http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.
변환된 중국어를 복사하여 사용하시면 됩니다.
안현 한국정보통신학회 2023 Journal of information and communication convergen Vol.21 No.1
Various machine-learning models may yield high predictive power for massive time series for time series prediction. However, these models are prone to instability in terms of computational cost because of the high dimensionality of the feature space and nonoptimized hyperparameter settings. Considering the potential risk that model training with a high-dimensional feature set can be time-consuming, we evaluate a feature-importance-based feature selection method to derive a tradeoff between predictive power and computational cost for time series prediction. We used two machine learning techniques for performance evaluation to generate prediction models from a retail sales dataset. First, we ranked the features using impurity- and Local Interpretable Model-agnostic Explanations (LIME) -based feature importance measures in the prediction models. Then, the recursive feature elimination method was applied to eliminate unimportant features sequentially. Consequently, we obtained a subset of features that could lead to reduced model training time while preserving acceptable model performance.
Stacked Autoencoder 기반 악성코드 Feature 정제 기술 연구
김홍비(Hong-bi Kim),이태진(Tae-jin Lee) 한국정보보호학회 2020 정보보호학회논문지 Vol.30 No.4
네트워크의 발전에 따라 악성코드 생성도구가 유포되는 등으로 인해 악성코드의 출현이 기하급수적으로 증가하였으나 기존의 악성코드 탐지 방법을 통한 대응에는 한계가 존재한다. 이러한 상황에 따라 머신러닝 기반의 악성 코드 탐지 방법이 발전하는 추세이며, 본 논문에서는 머신러닝 기반의 악성 코드 탐지를 위해 PE 헤더에서 데이터의 feature를 추출한 후 이를 이용하여 autoencoder를 통해 악성코드를 더 잘 나타내는 feature 및 feature importance를 추출하는 방법에 대한 연구를 진행한다. 본 논문은 악성코드 분석에서 범용적으로 사용되는 PE 파일에서 확인 가능한 DLL/API 등의 정보로 구성된 549개의 feature를 추출하였고 머신러닝의 악성코드 탐지 성능 향상을 위해 추출된 feature를 이용하여 autoencoder를 통해 데이터를 압축적으로 저장함으로써 데이터의 feature를 효과적으로 추출해 우수한 정확도 제공 및 처리 시간을 2배 단축에 성공적임을 증명하였다. 시험 결과는 악성코드 그룹 분류에도 유용함을 보였으며, 향후 SVM과 같은 분류기를 도입하여 더욱 정확한 악성코드 탐지를 위한 연구를 이어갈 예정이다. The advent of malicious code has increased exponentially due to the spread of malicious code generation tools in accordance with the development of the network, but there is a limit to the response through existing malicious code detection methods. According to this situation, a machine learning-based malicious code detection method is evolving, and in this paper, the feature of data is extracted from the PE header for machine-learning-based malicious code detection, and then it is used to automate the malware through autoencoder. Research on how to extract the indicated features and feature importance. In this paper, 549 features composed of information such as DLL/API that can be identified from PE files that are commonly used in malware analysis are extracted, and autoencoder is used through the extracted features to improve the performance of malware detection in machine learning. It was proved to be successful in providing excellent accuracy and reducing the processing time by 2 times by effectively extracting the features of the data by compressively storing the data. The test results have been shown to be useful for classifying malware groups, and in the future, a classifier such as SVM will be introduced to continue research for more accurate malware detection.
Seongryul Park,Seungjae Lee,Eunkyoung Park,Jongshill Lee,In Young Kim 대한의용생체공학회 2023 Biomedical Engineering Letters (BMEL) Vol.13 No.4
Pulse arrival time (PAT) and PPG morphological features have attracted much interest in cuffless blood pressure (BP) estimation,but their effects are not clearly understood when vascular characteristics are affected by diseases such as diabetes. This work quantitatively analyzes the effect of diabetic disease on the PAT and PPG morphological features-based BP estimation. We selected 112 diabetic patients and 308 non-diabetic subjects from VitalDB, and extracted 16 features includingPAT, PPG morphological features, and heart rate. BP estimation performance was statistically compared between groupsusing linear regression models with several feature sets, and the relative importance of each feature in the optimal featureset was extracted. As a result, the standard deviation of the error and mean absolute error of PAT-based BP estimation weresignificantly higher in the diabetic group than in the non-diabetic group (p < 0.01). A feature set containing PAT and PPGmorphological features achieved the best performance in both groups. However, the relative importance of each feature forBP estimation differed notably between groups. The results indicate that different features are important depending on thevascular characteristics, which could help to construct different models to accommodate specific diseases.
입력변수 구성에 따른 총유기탄소(TOC) 예측 머신러닝 모형의 성능 비교
이소현,박정수 유기성자원학회 2024 유기물자원화 Vol.32 No.3
총 유기 탄소 (total organic carbon, TOC)는 물에 포함된 유기 탄소의 총량을 나타내며 BOD, COD와 함께수중의 유기물질량에 대한 정량적인 지표로 활용되는 대표적인 수질 항목이다. 본 연구에서는 대표적인 앙상블(ensemble) 머신러닝 알고리즘의 하나인 XGBoost (XGB)를 이용하여 TOC를 예측하는 모형을 구축하였다. 모형의구축을 위한 독립변수로는 수온, pH, 전기전도도, 용존 산소 농도, 생물화학적 산소요구량, 화학적 산소요구량, 부유물질, 총질소, 총인 및 유량을 활용하였다. 또한 모형의 구축에 활용된 다양한 수질 항목의 영향에 대한 정량적인분석을 위해 입력변수의 feature importance를 산정하였으며, 이를 기반으로 변수중요도에 따라 중요도가 낮은 항목을 순차적으로 제외하여 모형의 성능 변화를 분석하였다. 변수중요도가 낮은 항목을 순차적으로 제외하여 구축한모형의 성능은 RSR (root mean squared error-observation standard deviation ratio) 0.53~0.55의 범위를 보였으며, 전체입력변수를 적용한 모형의 RSR 값은 0.53로 가장 우수한 성능을 보이는 것으로 분석되었다. 또한 모형의 현장적용성을 높이기 위해 현장 측정이 상대적으로 용이한 측정항목을 중심으로 모형을 구축하고 성능을 분석하였다. 분석결과 상대적으로 측정이 용이한 항목인 수온, pH, 전기전도도, 용존산소농도, 부유물질농도만으로 구축된 모형의경우에도 RSR 값이 0.72로 분석되어 상대적으로 측정이 용이한 현장 수질측정항목만을 이용하는 경우에도 안정적인성능의 확보가 가능할 수 있음을 확인하였다. Total organic carbon (TOC) represents the total amount of organic carbon contained in water and is a key water quality parameter used, along with biochemical oxygen demand (BOD) and chemical oxygen demand (COD), to quantify the amount of organic matter in water. In this study, a model to predict TOC was developed using XGBoost (XGB), a representative ensemble machine learning algorithm. Independent variables for model construction included water temperature, pH, electrical conductivity, dissolved oxygen concentration, BOD, COD, suspended solids, total nitrogen, total phosphorus, and discharge. To quantitatively analyze the impact of various water quality parameters used in model construction, the feature importance of input variables was calculated. Based on the results of feature importance analysis, items with low importance were sequentially excluded to observe changes in model performance. When built by sequentially excluding items with low importance, the performance of the model showed a root mean squared error-observation standard deviation ratio (RSR) range of 0.53 to 0.55. The model that applied all input variables showed the best performance with an RSR value of 0.53. To enhance the model's field applicability, models using relatively easily measurable parameters were also built, and the performance changes were analyzed. The results showed that a model constructed using only the relatively easily measurable parameters of water temperature, electrical conductivity, pH, dissolved oxygen concentration, and suspended solids had an RSR of 0.72. This indicates that stable performance can be achieved using relatively easily measurable field water quality parameters.
빅데이터 분석기법을 이용한 중소기업 성장 예측 모델 연구
모혜란,김현경,김 현 대한전자공학회 2023 전자공학회논문지 Vol.60 No.3
Through this paper, we would like to introduce a predictive model for the future growth potential of SMEs based on corporate big data analysis. In particular, financial data is the most important variable related to corporate growth. In previous studies, financial status is frequently used to predict corporate growth potential. However, in this paper, the company's financial data and the company's stock price are used as major variables to predict the company's growth potential. Based on Feature Importance, major variables related to corporate growth were selected. It was confirmed that the company's financial position and stock price are related to each other using the K-Means algorithm. This is because various indicators such as the possibility of a company's entry/expansion into the market, technological advantage/discrimination and expertise, management ability, growth, profitability, and stability are reflected in the company's stock price. In this paper, we were able to propose a model that can predict a company's growth potential using PCA and Feature Importance. 우리는 본 논문을 통해 기업의 빅데이터 분석을 바탕으로 중소기업의 미래 성장 가능성에 대한 예측 모델 소개하고자 한다. 특히 재무 데이터는 기업의 성장과 관련된 가장 중요한 변수인데, 기존 연구들에서 기업 성장 가능성 예측에 빈번하게 사용되고 있다. 그러나 본 논문에서는 기업의 재무 데이터와 기업의 주가가 기업 성장 가능성을 예측하는 주요 변수로 활용하였는데, Feature Importance를 기반으로 기업 성장과 관련 있는 주요 변수들을 선택하고 이는 K-Means 알고리즘을 활용해 기업의 재무상태와 주가가 서로 연관이 있음을 확인하였다. 특히 기업의 시장진입/확대가능성, 기술우위/차별성 및 전문성, 경영능력, 성장성, 수익성, 안정성 등의 다양한 지표들이 주가에 반영되기 때문이다. 우리는 본 논문을 통해 PCA와 Feature Importance를 이용해 기업의 성장 가능성을 예측할 수 있는 모델을 제안할 수 있었다.
Bin Cheng,Dingjie Guan,Bingxue Jing 한국정밀공학회 2022 International Journal of Precision Engineering and Vol.23 No.2
Small and medium-sized manufacturing enterprises involve a lot of customized products. The degree of adaptability should be noted while improving product design and manufacturing digital and intelligent levels. This paper presents a process sequencing method of manufacturing features based on the node importance of a complex network. The method is based on the adjacency matrix and connected graph to analyze the process constraint semantics of the product model. The adjacency matrix expresses the positioning dimensions between features. The connected graph is applied to define the constraint relationships between features and aggregate the multi-dimensional process dimension chain in all directions. Based on the processing sequence of node importance in a complex network, most of process planning can be realized. The method also can make adaptive decisions for different structural parts and monitor the machining of key features. Examples verify the validity and feasibility of the proposed method.
머신러닝 분류 알고리즘을 활용한 선박 접안속도 영향요소의 중요도 분석
이형탁,이상원,조장원,조익순 해양환경안전학회 2020 해양환경안전학회지 Vol.26 No.2
The most important factor affecting the berthing energy generated when a ship berths is the berthing velocity. Thus, an accident may occur if the berthing velocity is extremely high. Several ship features influence the determination of the berthing velocity. However, previous studies have mostly focused on the size of the vessel. Therefore, the aim of this study is to analyze various features that influence berthing velocity and determine their respective importance. The data used in the analysis was based on the berthing velocity of a ship on a jetty in Korea. Using the collected data, machine learning classification algorithms were compared and analyzed, such as decision tree, random forest, logistic regression, and perceptron. As an algorithm evaluation method, indexes according to the confusion matrix were used. Consequently, perceptron demonstrated the best performance, and the feature importance was in the following order: DWT , jetty number, and state. Hence, when berthing a ship, the berthing velocity should be determined in consideration of various features, such as the size of the ship, position of the jetty, and loading condition of the cargo. 선박이 접안할 때 발생하는 접안에너지에 가장 영향력이 큰 요소는 접안속도이며, 과도한 경우 사고로 이어질 수 있다. 접안속도의 결정에 영향을 미치는 요소는 다양하지만 기존 연구에서는 일반적으로 선박 크기에 제한하여 분석하였다. 따라서 본 연구에서는 다양한 선박 접안속도의 영향요소를 반영하여 분석하고 그에 따른 중요도를 도출하고자 한다. 분석에 활용한 데이터는 국내 한 탱커부두의 선박 접안속도를 실측한 것을 바탕으로 하였다. 수집된 데이터를 활용하여 머신러닝 분류 알고리즘인 의사결정나무(Decision Tree), 랜덤포레스트(Random Forest), 로지스틱회귀(Logistic Regression), 퍼셉트론(Perceptron)을 비교분석하였다. 알고리즘 평가 방법으로는 혼동 행렬에 따른 모델성능 평가지표를 사용하였다. 분석 결과, 가장 성능이 좋은 알고리즘으로는 퍼셉트론이 채택되었으며 그에 따른 접안속도 영향요인의 중요도는 선박 크기(DWT), 부두 위치(Jetty No.), 재화상태(State) 순으로 나타났다. 이에 따라 선박 접안 시, 선박의 크기를 비롯하여 부두 위치, 재화 상태 등 다양한 요인을 고려하여 접안속도를 설계하여야 한다.
박사윤,Musun Park,Won-Yung Lee,Choong-Yeol Lee,Ji-Hwan Kim,이시우,김창업 한국한의학연구원 2021 Integrative Medicine Research Vol.10 No.3
Background: Despite the importance of accurate Sasang type diagnosis, a unique form of Korean medicine, there have been concerns about consistency among diagnoses. We investigate a data-driven integrative diagnostic model by applying machine learning to a multicenter clinical dataset with comprehensive features. Methods: Extremely randomized trees (ERT), support vector machines, multinomial logistic regression, and K-nearest neighbor were applied, and performances were evaluated by cross-validation. The feature importance of the classifier was analyzed to understand which information is crucial in diagnosis. Results: The ERT classifier showed the highest performance, with an overall f1 score of 0.60 ± 0.060. The feature classes of body measurement, personality, general information, and cold–heat were more decisive than others in classifying Sasang types. Costal angle was the most informative feature. In pairwise classification, we found Sasang type-dependent distinctions that body measurement features played a key role in TE-SE and TE-SY datasets, while personality and cold–heat features showed importance in SE-SY dataset. Conclusion: Current study investigated a comprehensive diagnostic model for Sasang type using machine learning and achieved better performance than previous studies. This study helps data-driven decision making in clinics by revealing key features contributing to the Sasang type diagnosis.
4G/5G 네트워크 환경에서의 카테고리 특성 기여도 기반 Throughput 예측 모델 최적화
신재영,박지현 한국정보과학회 2024 정보과학회논문지 Vol.51 No.11
네트워크 데이터 소비의 증가와 4G 한계로 5G 기술 도입이 가속화되면서 4G와 제한된 5G의 이종네트워크 환경이 구축되었다. 이에 따라 네트워크 서비스 품질(QoS)과 자원 최적화를 위한 Throughput 예측의 중요성이 부각되었다. 기존 Throughput 예측 연구는 주로 단일 속성을 사용하거나, 상관 관계 분석을 통해 속성을 추출하여 사용한다. 그러나 이는 비선형적 관계를 가지는 변수 배제 가능성, 상관 계수 구분점의 임의성과 일관성 부족과 같은 한계를 지닌다. 본 논문은 이러한 한계를 극복하고자 특성 중요도(Feature Importance)를 사용하여 새로운 접근법을 제시한다. 이는 네트워크에서 사용되는 특성들의 상대적 중요도를 계산하여 속성 카테고리에 기여도를 부여한 후, 이를 이용하여 Throughput을 예측하는 방안이다. 이 방법은 4개의 오픈 데이터셋에 적용하여 실험을 수행하였고, 예측을 위한 최적 카테고리 조합을 도출하여 전체 카테고리 사용 대비 모델의 복잡성을 감소시키고 예측 정확도를 향상시켰다. The acceleration in 5G technology adoption due to increased network data consumption and limitations of 4G has led to the establishment of a heterogeneous network environment comprising both 4G and limited 5G. Consequently, this highlights the importance of throughput prediction for network service quality (QoS) and resource optimization. Traditional throughput prediction research mainly relies on the use of single attributes or extraction of attributes through correlation analysis. However, these approaches have limitations, including potential exclusion of variables with nonlinear relationships with arbitrariness and inconsistency of correlation coefficient thresholds. To overcome these limitations, this paper proposed a new approach based on Feature Importance. This method could calculate the relative importance of features used in the network and assign contribution scores to attribute categories. By utilizing these scores, throughput prediction was enhanced. This approach was applied and tested on four open network datasets. Experiments demonstrated that the proposed method successfully derived an optimal category combination for throughput prediction, reduced model complexity, and improved prediction accuracy compared to using all categories.
박정식,한호 한국영어학회 2022 영어학 Vol.22 No.-
Tale types of “Little Red Riding Hood” have survived through oral transmission in various areas including Europe, Africa, and Asia and can even be traced back to 10th century in a written form. This research presents quantitative analyses on the folkloric landscape of tales of, or related to, what is best known as Little Red Riding Hood through the Aarne-Thompson-Uther (ATU) index, of which we analyzed ATU 333, ATU 123, and other unspecified types, based on logistic regression and decision tree. The quantitative analyses of the Little Red Riding Hood tale types indicate that ATU 123 alone has the specific story segments that are important to the formation of the tale type and that though diversified in story segments and other details, the three types shared the distinct plot sequence as an important feature. In addition, eight event descriptors and six character and setting descriptors are found to be meaningful factors in the formation of ATU 123. It can be further argued that the plot as an abstraction played a major role in the formation of the tales we have now. Also demonstrated in this paper is that researchers can yield substantial insights into the quantitative results while cross-checking them with qualitative analyses.