http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.
변환된 중국어를 복사하여 사용하시면 됩니다.
장시환(Si-Hwan Jang),이준(Joon Lee),엄재현( Jae-Hyeon Eom),김성수(Sung-Soo Kim) 강원대학교 산업기술연구소 2024 산업기술연구 Vol.44 No.1
K-means is a popular and efficient data clustering method which is one of the most important technique in data mining. K-means is sensitive for initialization and has the possibility to be stuck in local optimum because of hill climbing clustering method. Therefore, we need a robust K-means (RK-means) not only to reduce this possibility but also to increase the probability to search the global optimal clustering solution. The objective of this paper is to propose RK-means with best initial solution from good solutions with good central data for each cluster. The central data of each cluster is selected based on Roulette wheel probabilistic selection using sum of relative distance rate of each data. They have a problem in high density data because they deterministically select the central data for just one initial solution of K-medoid. Our proposed initial solution is the good starting point to find the robust solution by K-means with reducing the possibility being stuck in local optimal solutions. The performance of proposed RK-means data clustering is validated using machine learning repository datasets (Iris, Wine, Glass, Vowel, Cloud) comparing to original K-means by experiment and analysis. Our simulation shows that RK-means using probabilistically relative distance rate are better than K-means with random initialization. The minimum squared distance by RK-means with smaller deviation is lower than that by K-means with higher deviation. RK-means is competitive comparing to data clustering methods based on simulated annealing (SA) and hybrid K-means with SA (KSA & KSAK).
Clustering using K-Means and Fuzzy C-Means on Food Productivity
Adriyendi 보안공학연구지원센터 2016 International Journal of u- and e- Service, Scienc Vol.9 No.12
This paper provided an overview of analysis and implementation clustering for food productivity. Food productivity is determined by food production. Rice is one of staple food in Indonesia. Rice production is influencing adequacy level of national food production. Rice productivity is very important to accomplishment food affordability. Rice productivity per province in Indonesia must be increased, because large population and high consumption. Rice productivity that fluctuates and tends to decrease, need to clustering to determinant category cluster of productivity. Clustering is using K-Means and Fuzzy C-Means. Method improvement of K-Means is modification Intra Cluster Distance and Inter Cluster Distance. Calculate distance (Inter Cluster Distance and Intra Cluster Distance) to evaluate the clustering results and to compare the efficiency of the clustering algorithms. Method improvement of Fuzzy C-Means is modification algorithm, alternative process and iteration. Data processing is using Excel software. Clustering produce three cluster (C1, C2, C3) is convergence. Measurement cluster based on comparison of membership cluster, consistency, and productivity. Membership cluster, there is point data anomaly (x22, x23, x29, x33). Consistency data on K-Means (C1 = 72.73%, C2 = 93.75%, C3 = 100%). Consistency data on Fuzzy C-Means (C1 = 100%, C2 = 88.33%, C3 = 87.50%). Rice Productivity is Cluster 1 (decrease), Cluster 2 (decrease, except 3 provinces), and Cluster 3 (increase, except 1 province). Majority in rice productivity is 70.59%. Result of clustering showed that majority rice productivity on category cluster is low productivity.
Approximate k values using Repulsive Force without Domain Knowledge in k-means
( Jung-jae Kim ),( Minwoo Ryu ),( Si-ho Cha ) 한국인터넷정보학회 2020 KSII Transactions on Internet and Information Syst Vol.14 No.3
The k-means algorithm is widely used in academia and industry due to easy and simple implementation, enabling fast learning for complex datasets. However, k-means struggles to classify datasets without prior knowledge of specific domains. We proposed the repulsive k-means (RK-means) algorithm in a previous study to improve the k-means algorithm, using the repulsive force concept, which allows deleting unnecessary cluster centroids. Accordingly, the RK-means enables to classifying of a dataset without domain knowledge. However, three main problems remain. The RK-means algorithm includes a cluster repulsive force offset, for clusters confined in other clusters, which can cause cluster locking; we were unable to prove RK-means provided optimal convergence in the previous study; and RK-means shown better performance only normalize term and weight. Therefore, this paper proposes the advanced RK-means (ARK-means) algorithm to resolve the RK-means problems. We establish an initialization strategy for deploying cluster centroids and define a metric for the ARK-means algorithm. Finally, we redefine the mass and normalize terms to close to the general dataset. We show ARK-means feasibility experimentally using blob and iris datasets. Experiment results verify the proposed ARK-means algorithm provides better performance than k-means, k’-means, and RK-means.
An Improved K-means Algorithm based on Mapreduce and Grid
Li Ma,Lei Gu,Bo Li,Yue Ma,Jin Wang 보안공학연구지원센터 2015 International Journal of Grid and Distributed Comp Vol.8 No.1
The traditional K-means clustering algorithm is difficult to initialize the number of clusters K, and the initial cluster centers are selected randomly, this makes the clustering results very unstable. Meanwhile, algorithms are susceptible to noise points. To solve the problems, the traditional K-means algorithm is improved. The improved method is divided into the same grid in space, according to the size of the data point property value and assigns it to the corresponding grid. And count the number of data points in each grid. Selecting M(M>K) grids, comprising the maximum number of data points, and calculate the central point. These M central points as input data, and then to determine the k value based on the clustering results. In the M points, find K points farthest from each other and those K center points as the initial cluster center of K-means clustering algorithm. At the same time, the maximum value in M must be included in K. If the number of data in the grid less than the threshold, then these points will be considered as noise points and be removed. In order to make the improved algorithm can adapt to handle large data. We will parallel the improved k-mean algorithm and combined with the MapReduce framework. Theoretical analysis and experimental results show that the improved algorithm compared to the traditional K-means clustering algorithm has high quality results, less iteration and has good stability. Parallelized algorithm has a very high efficiency in data processing, and has good scalability and speedup.
허명회,손은진 高麗大學校統計硏究所 2003 應用統計 Vol.18 No.-
Rand index는 군집화의 재현성을 평가하기 위한 자료 분할법에서 두 군집화 결과간의 일치도를 재는 지표이지만 (Rand, 1971) 개체가 1개 군집에 명확히 할당되는 군집화에만 적용될 수 있다. 따라서, 본 연구의 대상인 퍼지 K-평균 군집화(fuzzy K-means clustering)에서는 개체가 각 군집에 속할 소속도(membership)로 제시되므로 Rand index를 원형 그대로 사용할 수 없다. 본 연구의 목적은 퍼지 K-평균 군집화 결과 간 일치성 평가에 활용 가능하도록 Rand index를 확장하는 것이다. 제안 방법을 요약하면 다음과 같다. 1) 훈련 데이터로부터 얻은 퍼지 K-평균 군집화 규칙을 테스트 자료의 각 개체에 적용하여 K개 (=군집 수) 퍼지 소속도를 구한다. 독립적인 다른 훈련 데이터로부터 얻게 되는 퍼지 K-평균 군집화 규칙을 테스트 자료의 동일 개체에 적용하여 또 다른 K개 퍼지 소속도를 구한다. 2) 각 퍼지 군집화 규칙에 따른 군집 소속도에 비례하게 테스트 자료의 개체를 독립적으로 K개 군집 중 하나에 임의 할당하는 역 퍼지화 작업을 시행하여 명확한 분할(hard partition) 자료를 만든다. 3) 대응하는 두 개의 분할 군집화 결과로부터 통상적인 Rand index (또는 Hubert and Arabie (1985)의 C.(corrected) Rand index)를 산출한다. 4) 앞의 두 단계를 일정 수 반복하여 Rand index의 몬테칼로(Monte Carlo) 분포를 산출한다. 그 분포의 평균을 확장(extended) Rand index로 정의한다. 퍼지 K-평균 군집화에서 군집 수 K를 결정하는 문제에 확장 Rand index를 활용할 수 있다. 몇 개의 적용 사례를 제시하고 토의할 것이다. Rand index is an evaluation measure of consistency between two clustering rules (Rand. 1971). Hence it can be used to predict whether the clustering patterns are reproducible in the future. The index, however, cannot be applied to the fuzzy K-means clustering which has clear merits in dealing with overlapping clusters. The aim of this study is to extend Rand index or corrected Rand index of Hubert and Arabie (1985) for the use in fuzzy K-means clustering. The proposed method can be summarized as follows : Step 1: Partition the data into three parts - two training samples and one. test sample. Then, derive a K-means clustering rule from the first training sample and another rule from the second training sample. Then, apply both rules separately to the test sample units to obtain the list of cluster membership pairs. Step 2: Perform the inverse procedure opposite to make things fuzzy. In other words, generate a pair of hard partitions according to respective memberships of fuzzy partitions. Step 3: Compute Rand index or corrected Rand index of Hubert and Arabie (1985) from a pair of hard partitions. Step 4: Repeat Steps 3 and 4 for sufficient number of times. Then, one obtains a batch of Rand indices. Define Extended Rand Index by the average of Rand indices. We may use Extended Rand Index in determination of the number of clusters Kin fuzzy K-means clustering. Several examples are illustrated.
맵리듀스를 이용한 다중 중심점 집합 기반의 효율적인 클러스터링 방법
강성민(Sungmin Kang),이석주(Seokjoo Lee),민준기(Jun-ki Min) 한국정보과학회 2015 정보과학회 컴퓨팅의 실제 논문지 Vol.21 No.7
데이터 사이즈가 증가함에 따라서 대용량 데이터를 분석하여 데이터의 특성을 파악하는 것이 매우 중요해졌다. 본 논문에서는 분산 병렬 처리 프레임워크인 맵리듀스를 활용한 k-Means 클러스터링 기반의 효과적인 클러스터링 기법인 MCSK-Means (Multi centroid set k-Means)알고리즘을 제안한다. k-Means 알고리즘은 임의로 정해지는 k개의 초기 중심점들의 위치에 따라서 클러스터링 결과의 정확도가 많은 영향을 받는 문제점을 가지고 있다. 이러한 문제를 해결하기 위하여, 본 논문에서 제안하는 MCSK-Means 알고리즘은 k개의 중심점들로 이루어진 m개의 중심점 집합을 사용하여 임의로 생성되는 초기 중심점의 의존도를 줄였다. 또한, 클러스터링 단계를 거친 m개의 중심점 집합들에 속한 중심점들에 대하여 직접 계층 클러스터링 알고리즘을 적용하여 k개의 클러스터 중심점들을 생성하였다. 본 논문에서는 MCSK-Means 알고리즘을 맵리듀스 프레임워크 환경에서 개발하여 대용량 데이터를 효율적으로 처리할 수 있도록 하였다. As the size of data increases, it becomes important to identify properties by analyzing big data. In this paper, we propose a k-Means based efficient clustering technique, called MCSKMeans (Multi centroid set k-Means), using distributed parallel processing framework MapReduce. A problem with the k-Means algorithm is that the accuracy of clustering depends on initial centroids created randomly. To alleviate this problem, the MCSK-Means algorithm reduces the dependency of initial centroids using sets consisting of k centroids. In addition, we apply the agglomerative hierarchical clustering technique for creating k centroids from centroids in m centroid sets which are the results of the clustering phase. In this paper, we implemented our MCSK-Means based on the MapReduce framework for processing big data efficiently.
K-Means Clustering을 활용한 냉수대 발생 분포에 관한 연구
김범규(Bum-Kyu Kim),윤홍주(Hong-Joo Yoon),이준호(Jun Ho Lee) 한국전자통신학회 2021 한국전자통신학회 논문지 Vol.16 No.2
본 연구에서는 한국 남동해역에 발생하는 냉수대의 공간적인 분포를 구분하기 위해 2016 ∼ 2018년의 고리, 양포의 해양 관측 부이 수온자료와 GHTSST Level 4 재분석 해수면 온도자료를 K-means clustering 기법을 활용하여 분석하였다. 부이자료는 남동해역에서 고리와 양포 지점의 수온변화 및 냉수대 발생을 파악하기 위해 활용하였다. 그 결과 냉수대 발생 시점에 고리와 양포의 수온이 동일하게 감소하였다. 이에 냉수대 발생시 SST의 변화를 보기 위해 수온의 역수와 SST의 분산을 비교하였다. 수온이 변화하는 시점에 SST의 분산도 증가하는 것을 나타내었는데 이를 통해 냉수대 발생시 해역의 SST의 수온분포에 변화가 있다는 것을 알 수 있었다. 냉수대 발생해역을 분류하기 위해 K-means clustering을 활용하였다. Elbow 기법을 활용하여 분류를 위한 최적의 K값을 찾아낸 후 분류를 진행한 결과 연안의 차가운 해수가 존재하는 지역을 찾아낼 수 있었다. 이를 통해 냉수대 발생해역의 공간적인 분포 및 확산범위를 추정하여 향후 냉수대로 인한 피해 파악 및 공간적인 확산 예측연구에 활용할 수 있을 것이라 판단된다. In this study, in order to analyze the spatial distribution of cold water occurred in the Southeast Sea of Korea, the K-means clustering method was used to analyze the ocean observatory buoy of Gori and Yangpo and GHTSST Level 4 from 2016 to 2018. The buoy data was used to identify the change in sea water temperature and the cold water occurrence at Gori and Yangpo in the Southeast Sea. As a result, the sea water temperature of Gori and Yangpo decreased equally at the cold water occurrence. Therefore, the reciprocal of the sea water temperature and the variance of SST were compared to see the changes of SST when the cold water occurs. When the reciprocal of the sea water temperature increases, the dispersion of SST also increases. Through this, it can be seen that there is a change in the water temperature distribution of SST in the sea when the cold water occurs. After that, K-means clustering was used to classify the cold water. After analyzing the optimal K value for clustering by using the Elbow method, it was possible to classify a region with cold water. Through this, it is estimated that the spatial distribution and diffusion range of the cold water, and it can be estimated and used in future studies to identify damage caused by the cold water and predict spatial spread.
An Efficient K-Means Algorithm and its Benchmarking against other Algorithms
Anupama Chadha,Suresh Kumar 보안공학연구지원센터 2016 International Journal of Grid and Distributed Comp Vol.9 No.11
K-Means is a widely used partition based clustering algorithm famous for its simplicity and speed. It organizes input dataset into predefined number of clusters. K-Means has a major limitation -- the number of clusters, K, need to be pre-specified as an input. Pre-specifying K in the K-Means algorithm sometimes becomes difficult in absence of thorough domain knowledge, or for a new and unknown dataset. This limitation of advance specification of cluster number can lead to “forced” clustering of data and proper classification does not emerge. In this paper, a new algorithm based on the K-Means is developed. This algorithm has advance features of intelligent data analysis and automatic generation of appropriate number of clusters. The clusters generated by the new algorithm are compared against results obtained with the original K-Means and various other famous clustering algorithms. This comparative analysis is done using sets of real data.
K-Means Clustering of Shakespeare Sonnets with Selected Features
T. Senthil Selvi,R. Parimala 보안공학연구지원센터 2016 International Journal of Database Theory and Appli Vol.9 No.8
This paper focuses on clustering the lines of Shakespeare Sonnets. Sonnet Line Clustering (SLC) is the task of grouping a set of lines in such a way that lines in the same cluster are more similar to each other than to those in other clusters. K-Means clustering is a very effective clustering technique well known for its observed speed and its simplicity. Its aim is to find the best division of N lines into K groups (clusters), so that the total distance between the groups’s members and corresponding centroid, is minimized. A new algorithm Sonnet Line Clustering with Random Feature Selection SLCRFS is proposed. To validate the process external validation or internal validation is done. Since, internal validation has no considerable impact in conducting research this work concentrates on the measures of external validation. Entropy and Purity are frequently used external measures of validation for K-Means. The proposed approach uses entropy as performance measure. The clusters formed are evaluated and interpreted according to the Euclidean measure between data points and cluster centers of each cluster. This paper concludes with an analysis of the results of using the proposed measure to display the clustered sonnets using K-Means algorithm with minimum entropy for different feature sets.
이영찬 한국자료분석학회 2011 Journal of the Korean Data Analysis Society Vol.13 No.3
The purpose of this study is to build a clustering-based performance prediction model to predict financial performance of small-medium enterprises using KIBO technology rating data. The clustering-based performance prediction model is ex-post model to predict future's performance without a priori information such as bankruptcy/non-bankruptcy. The exogenous variables for predicting the financial performance (cluster: by k-means clustering) are 45 KTRS technology rating data. Specifically, after performing k-means clustering using conventional financial ratios (view of growth, profitability, activity, stability, efficiency) of companies, three clusters are derived from k-means clustering analysis. In addition, this paper uses discriminant analysis in order to select technology rating variables that are significant to predicting financial performance. Lastly, this paper uses case-based reasoning with k-nearest neighbor in order to predict future's financial performance (clusters). Although the ex-post information, the results of analysis show the good predictive power through AUROC. 본 연구의 목적은 기술보증기금의 기술평가 자료를 이용한 사례기반추론을 통해 기술보증지원을 받은 중소기업들의 재무성과를 예측하는 군집화기반 성과예측모형을 구축하는 것이다. 군집화기반 성과예측모형이란 부도/건전과 같은 사전적 정보가 없는 상태에서 미래의 성과를 예측하는 사후적(ex-post) 모형으로서, 성과는 기존 재무자료를 이용한 재무성과를 의미하며, 재무성과(군집: k-평균 군집분석 사용)를 예측하기 위한 외생변수는 45개 KTRS 기술평가항목이다. 구체적으로, 기술보증평가를 받은 기업의 성장성, 수익성, 활동성, 안정성, 효율성 관점의 기존 재무비율을 이용하여 k-평균 군집분석을 수행한 후 상위그룹, 중위그룹, 하위그룹의 세 가지 군집을 추출하였다. 또한 45개 기술평가 항목 중 어떤 항목이 재무성과를 예측하는데 유용한 가를 파악하기 위해 판별분석을 사용하였다. 마지막으로 재무성과 예측에 유의한 기술평가항목을 이용하여 미래의 재무성과를 예측하기 위해 데이터마이닝 기법인 k-최근접 이웃법(k-nearest neighbor)을 이용한 사례기반추론을 적용하였다. 분석결과, 사후적인 기술평가항목을 이용하였음에도 불구하고 k-최근접 이웃법을 이용한 검증집합의 예측력이 비교적 높은 것으로 나타났다.