RISS 학술연구정보서비스

검색
다국어 입력

http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.

변환된 중국어를 복사하여 사용하시면 됩니다.

예시)
  • 中文 을 입력하시려면 zhongwen을 입력하시고 space를누르시면됩니다.
  • 北京 을 입력하시려면 beijing을 입력하시고 space를 누르시면 됩니다.
닫기
    인기검색어 순위 펼치기

    RISS 인기검색어

      검색결과 좁혀 보기

      선택해제
      • 좁혀본 항목 보기순서

        • 원문유무
        • 원문제공처
          펼치기
        • 등재정보
          펼치기
        • 학술지명
          펼치기
        • 주제분류
          펼치기
        • 발행연도
          펼치기
        • 작성언어

      오늘 본 자료

      • 오늘 본 자료가 없습니다.
      더보기
      • 무료
      • 기관 내 무료
      • 유료
      • A Study on Imbalanced Data Stream Processing Using a Mass Function

        Su-Hee Kim,Dong-Hyok Suh 보안공학연구지원센터 2015 International Journal of Software Engineering and Vol.9 No.11

        In the IOT environment, sensor data stream consists of event data from heterogeneous multi-sensors. One type of sensor may have quite a different event frequency from those other kinds of sensors, which makes most sensor data sets imbalanced. To classify an imbalanced data effectively, it is necessary to preprocess it for converting into a balanced data. This process may unify heterogeneous attributes in the imbalanced data and alleviate the difficulties for data mining on it. Mass function plays an important role in the fuzzy theory and Dempster-Shafer Theory. In this paper, using a mass function is suggested to process imbalanced data stream. A mass function is developed to compute mass values for imbalanced data sets, and an experiment is performed to investigate the validity to apply the mass function to the sensor data stream.

      • KCI등재

        불균형 데이터 집합의 분류를 위한 하이브리드 SVM 모델

        이재식(Jae Sik Lee),권종구(Jong Gu Kwon) 한국지능정보시스템학회 2013 지능정보연구 Vol.19 No.2

        We call a data set in which the number of records belonging to a certain class far outnumbers the number of records belonging to the other class, ‘imbalanced data set’. Most of the classification techniques perform poorly on imbalanced data sets. When we evaluate the performance of a certain classification technique, we need to measure not only ‘accuracy’ but also ‘sensitivity’ and ‘specificity’ In a customer churn prediction problem, ‘retention’ records account for the majority class, and ‘churn’ records account for the minority class. Sensitivity measures the proportion of actual retentions which are correctly identified as such. Specificity measures the proportion of churns which are correctly identified as such. The poor performance of the classification techniques on imbalanced data sets is due to the low value of specificity. Many previous researches on imbalanced data sets employed ‘oversampling’ technique where members of the minority class are sampled more than those of the majority class in order to make a relatively balanced data set. When a classification model is constructed using this oversampled balanced data set, specificity can be improved but sensitivity will be decreased. In this research, we developed a hybrid model of support vector machine(SVM), artificial neural network (ANN) and decision tree, that improves specificity while maintaining sensitivity. We named this hybrid model ‘hybrid SVM model.’ The process of construction and prediction of our hybrid SVM model is as follows. By oversampling from the original imbalanced data set, a balanced data set is prepared. SVM_I model and ANN_I model are constructed using the imbalanced data set, and SVM_B model is constructed using the balanced data set. SVM_I model is superior in sensitivity and SVM_B model is superior in specificity. For a record on which both SVM_I model and SVM_B model make the same prediction, that prediction becomes the final solution. If they make different prediction, the final solution is determined by the discrimination rules obtained by ANN and decision tree. For a record on which SVM_I model and SVM_B model make different predictions, a decision tree model is constructed using ANN_I output value as input and actual retention or churn as target. We obtained the following two discrimination rules : ‘IF ANN_I output value <0.285, THEN Final Solution = Retention’ and ‘IF ANN_I output value ≥0.285, THEN Final Solution = Churn.’ The threshold 0.285 is the value optimized for the data used in this research. The result we present in this research is the structure or framework of our hybrid SVM model, not a specific threshold value such as 0.285. Therefore, the threshold value in the above discrimination rules can be changed to any value depending on the data. In order to evaluate the performance of our hybrid SVM model, we used the ‘churn data set’ in UCI Machine Learning Repository, that consists of 85% retention customers and 15% churn customers. Accuracy of the hybrid SVM model is 91.8% that is better than that of SVM_I model or SVM_B model. The points worth noticing here are its sensitivity, 95.02%, and specificity, 69.24%. The sensitivity of SVM_I model is 94.65%, and the specificity of SVM_B model is 67.00%. Therefore the hybrid SVM model developed in this research improves the specificity of SVM_B model while maintaining the sensitivity of SVM_I model.

      • KCI등재

        An Intelligent Fault Diagnosis Method for Imbalanced Nuclear Power Plant Data Based on Generative Adversarial Networks

        Dai Yuntao,Peng Lizhang,Juan Zhaobo,Liang Yuan,Shen Jihong,Wang Shujuan,Tan Sichao,Yu Hongyan,Sun Mingze 대한전기학회 2023 Journal of Electrical Engineering & Technology Vol.18 No.4

        In the fault diagnosis problem, where sample data of fault cases are imbalanced, data generation and expansion are performed based on a generative adversarial network to obtain balanced data for training. Combining a gated recurrent neural network and an autoencoder model, the GRU-BEGAN model for generating multiple time series data is proposed for the intelligent fault diagnosis of imbalanced nuclear power plant data. To guarantee the consistency of the probability distribution between the generated data and real data, the K-L losses are included as a part of the loss function of the generator. At the same time, the potential feature vector of the real data obtained by the discriminator encoder is introduced as a hidden variable in the generator, and the similarity between the generated data and the real data is controlled by introducing the hidden variables according to the probability to make the generated data diverse. For the imbalanced fault dataset of the nuclear power plant thermal–hydraulic systems, the proposed GRU-BEGAN model is used to expand the original data to obtain a balanced state. Then, a 1D-CNN fault diagnosis model is established based on a convolutional neural network. The experimental results show that the fault diagnosis accuracy of the total test data is improved by 1.45% after data expansion, and the fault diagnosis accuracy of the minority sample is improved by 6.8% after data expansion.

      • KCI등재

        Heterogeneous Ensemble of Classifiers from Under-Sampled and Over-Sampled Data for Imbalanced Data

        Dae-Ki Kang,Min-gyu Han 한국인터넷방송통신학회 2019 Journal of Advanced Smart Convergence Vol.8 No.1

        Data imbalance problem is common and causes serious problem in machine learning process. Sampling is one of the effective methods for solving data imbalance problem. Over-sampling increases the number of instances, so when over-sampling is applied in imbalanced data, it is applied to minority instances. Under-sampling reduces instances, which usually is performed on majority data. We apply under-sampling and over-sampling to imbalanced data and generate sampled data sets. From the generated data sets from sampling and original data set, we construct a heterogeneous ensemble of classifiers. We apply five different algorithms to the heterogeneous ensemble. Experimental results on an intrusion detection dataset as an imbalanced datasets show that our approach shows effective results.

      • KCI등재

        Heterogeneous Ensemble of Classifiers from Under-Sampled and Over-Sampled Data for Imbalanced Data

        Kang, Dae-Ki,Han, Min-gyu The Institute of Internet 2019 Journal of Advanced Smart Convergence Vol.8 No.1

        Data imbalance problem is common and causes serious problem in machine learning process. Sampling is one of the effective methods for solving data imbalance problem. Over-sampling increases the number of instances, so when over-sampling is applied in imbalanced data, it is applied to minority instances. Under-sampling reduces instances, which usually is performed on majority data. We apply under-sampling and over-sampling to imbalanced data and generate sampled data sets. From the generated data sets from sampling and original data set, we construct a heterogeneous ensemble of classifiers. We apply five different algorithms to the heterogeneous ensemble. Experimental results on an intrusion detection dataset as an imbalanced datasets show that our approach shows effective results.

      • KCI등재

        Heterogeneous Ensemble of Classifiers from Under-Sampled and Over-Sampled Data for Imbalanced Data

        강대기,한민규 한국인터넷방송통신학회 2019 Journal of Advanced Smart Convergence Vol.8 No.1

        Data imbalance problem is common and causes serious problem in machine learning process. Sampling is one of the effective methods for solving data imbalance problem. Over-sampling increases the number of instances, so when over-sampling is applied in imbalanced data, it is applied to minority instances. Under-sampling reduces instances, which usually is performed on majority data. We apply under-sampling and over-sampling to imbalanced data and generate sampled data sets. From the generated data sets from sampling and original data set, we construct a heterogeneous ensemble of classifiers. We apply five different algorithms to the heterogeneous ensemble. Experimental results on an intrusion detection dataset as an imbalanced datasets show that our approach shows effective results.

      • 실제 데이터의 특성을 고려한 개선된 인위적인 불균형 클래스데이터 생성

        김은진(Eun Jin Kim),허욱(Uk Heo),김병철(Byoung Chul Kim),엄일규(Il-Kyu Eom),김영인(Young In Kim) 한국정보기술학회 2011 Proceedings of KIIT Conference Vol.2011 No.5

        불균형 클래스 분류 문제에 있어서 다양한 실제 데이터를 구하는 것은 어려운 문제이며 이를 대체할 수 있는 인위적인 데이터의 생성에 대한 연구가 활발히 이루어지고 있다. 그러나 실제 상황에서 발생하는 클래스 간 중첩 정도와 다양한 분포를 고려한 연구는 부족한 실정이다. 본 논문은 예비 연구로서 분류 성능에 영향을 크게 미치는 데이터 특성인 클래스 간 데이터 중첩과 정규 분포의 특성을 갖는 인위적인 데이터를 제안한다. 제안한 데이터는 기존 연구에서 미흡한 실제 데이터의 특성을 포함하고 있다. 실험 결과, 기존 데이터보다 실제 데이터에 가까운 특성을 갖고 있음을 확인하였다. One of the imbalanced data classification problems is difficult to obtain real data from various fields. Many researches about generating synthetic data to replace real data have been studied actively. However, there is a lack of research considering the degree of overlapped area among classes and the various distributions occurring in the real-world situation. In this paper, we propose new synthetic data that include two characteristics to affect the classification performance as a preliminary study. The two characteristics are a normal distribution and the overlap area among classes. Therefore the proposed data include the characteristics of real data that are insufficient in previous research. Experimental results show that its performances are more close to the classification performance of real data than the existing one.

      • KCI등재

        불균형 데이터 분류를 위한 딥러닝 기반 오버샘플링 기법

        손민재,정승원,황인준 한국정보처리학회 2019 정보처리학회논문지. 소프트웨어 및 데이터 공학 Vol.8 No.7

        분류 문제는 주어진 입력 데이터에 대해 해당 데이터의 클래스를 예측하는 문제로, 자주 쓰이는 방법 중의 하나는 주어진 데이터셋을 사용하여 기계학습 알고리즘을 학습시키는 것이다. 이런 경우 분류하고자 하는 클래스에 따른 데이터의 분포가 균일한 데이터셋이 이상적이지만, 불균형한 분포를 가지고 경우 제대로 분류하지 못하는 문제가 발생한다. 이러한 문제를 해결하기 위해 본 논문에서는 Conditional Generative Adversarial Networks(CGAN)을 활용하여 데이터 수의 균형을 맞추는 오버샘플링 기법을 제안한다. CGAN은 Generative Adversarial Networks(GAN)에서 파생된 생성 모델로, 데이터의 특징을 학습하여 실제 데이터와 유사한 데이터를 생성할 수 있다. 따라서 CGAN이 데이터 수가 적은 클래스의 데이터를 학습하고 생성함으로써 불균형한 클래스 비율을 맞추어 줄 수 있으며, 그에 따라 분류 성능을 높일 수 있다. 실제 수집된 데이터를 이용한 실험을 통해 CGAN을 활용한 오버샘플링 기법이 효과가 있음을 보이고 기존 오버샘플링 기법들과 비교하여 기존 기법들보다 우수함을 입증하였다. Classification problem is to predict the class to which an input data belongs. One of the most popular methods to do this is training a machine learning algorithm using the given dataset. In this case, the dataset should have a well-balanced class distribution for the best performance. However, when the dataset has an imbalanced class distribution, its classification performance could be very poor. To overcome this problem, we propose an over-sampling scheme that balances the number of data by using Conditional Generative Adversarial Networks (CGAN). CGAN is a generative model developed from Generative Adversarial Networks (GAN), which can learn data characteristics and generate data that is similar to real data. Therefore, CGAN can generate data of a class which has a small number of data so that the problem induced by imbalanced class distribution can be mitigated, and classification performance can be improved. Experiments using actual collected data show that the over-sampling technique using CGAN is effective and that it is superior to existing over-sampling techniques.

      • KCI등재

        불균형 자료의 분류분석 방법별 성능 비교와 접근 전략 연구

        유병주(Byung Joo Yoo) 한국자료분석학회 2021 Journal of the Korean Data Analysis Society Vol.23 No.1

        불균형 자료에 대한 분류분석을 하기 위해서는 두 가지 선택의 문제에 직면하게 된다. 하나는 분류분석을 위한 모형의 선택이고 또 다른 하나는 불균형 문제를 해결하기 위한 방법의 선택이다. 그래서 이 논문에서는 훈련표본의 규모나 독립변수의 수, 불균형 정도 등과 같은 데이터의 특징을 고려한 불균형 자료에 대한 순차적인 접근 전략 문제를 다루었다. 이를 위해 이진 분류 분석의 대표적인 모형인 로지스틱 회귀모형, 서포트벡터 머신, 딥러닝 방법을 자료의 특성에 따른 분류 성능을 비교하기 위한 이론적 고찰과 모의실험을 시행하였다. 그리고 자료의 불균형을 해결하기 위한 개선 방법들과 조합했을 때 Tukey의 다중비교를 통하여 분류 성능이 좋은 최적의 결과를 얻기 위한 접근 전략을 식별하기 위한 모의실험을 하였다. 모의실험 결과 자료의 특성중 훈련표본의 수량과 불균형 여부가 지배적인 요소로 작동되는 것을 확인할 수 있었으며, 훈련 표본이 적은 경우는 로지스틱 회귀모형으로 접근하여 과대추출 방법으로 자료의 불균형 문제를 해결하는 방법이 좋고, 훈련표본이 많은 경우는 딥러닝 방법으로 접근하여 가중치 방법이나 과소추출 방법으로 자료의 불균형을 개선하는 방법이 성능이 우수한 추정 결과를 얻을 수 있는 접근 전략임을 확인하였다. In order to perform a classification analysis on imbalanced data, we are faced with two choices. One is the selection of a model for classification analysis, and the other is the selection of a method to solve the imbalance problem. Therefore, in this paper, I dealt with the problem of sequential approach to imbalanced data, taking into account the characteristics of the data such as the size of the training sample, the number of independent variables, and the degree of imbalance. A simulation is conducted to compare the logistic regression model, support vector machine, and deep learning, which are representative models used for binary classification analysis, to compare the classification performance according to the characteristics of the data. In addition, a simulation was performed to identify the approach strategy for obtaining the optimal result with good classification performance through Tukey s multiple comparison when combined with the methods to resolve the imbalance problem. As a result of the simulation, it was confirmed that the number of acquired samples and the presence of imbalance among the characteristics of the data operate as the dominant factors. In the case of small data, the logistic regression model is the best when combine with the over-sampling method to solve the data imbalance problem. In the case of big data, it was confirmed that the deep learning is the best when combine with the weighed estimation or the under sampling method to resolve the data imbalance problem.

      • Imbalanced Data Stream Mining for Context Inference

        Su-Hee Kim,Dong-Hyok Suh 한국정보통신학회 2015 2016 INTERNATIONAL CONFERENCE Vol.7 No.1

        Sensor data stream is imbalanced in many cases. To classify an imbalanced data effectively, it is necessary to preprocess for converting it into a balanced data. This process may unify heterogeneous attributes in the imbalanced data and alleviate the difficulties for data mining on it. In this paper, a mass function is developed, and an experiment is performed to investigate the validity to apply the mass function to the sensor data sensor stream.

      연관 검색어 추천

      이 검색어로 많이 본 자료

      활용도 높은 자료

      해외이동버튼