Cancer is a class of complex genetic diseases characterized by out-of-control cell growth. Cancer classification has been a crucial topic of research in cancer treatment. For the last decade, mRNA expression profiling with microarray has been widely u...
Cancer is a class of complex genetic diseases characterized by out-of-control cell growth. Cancer classification has been a crucial topic of research in cancer treatment. For the last decade, mRNA expression profiling with microarray has been widely used to classify the different types of human cancers. However, microarray data poses a severe challenge for computational techniques. We need dimension reduction techniques that identify a small set of genes to achieve better learning performance. From the perspective of machine learning, the selection of genes can be considered to be a feature selection problem that aims to find a small subset of features that has the most discriminative information for the target.
In this thesis, we proposed an Ensemble Correlation-Based Gene Selection (ECBGS) algorithm based on symmetrical uncertainty (SU) and Support Vector Machine (SVM). In our method, symmetrical uncertainty was used to analyze the relevance of the genes, the different starting points of the relevant subset were used to generate the gene subsets, and the SVM was used as an evaluation criterion of the wrapper.
During the experiments, we used six freely accessible benchmark datasets from the Internet to meet our objective, which was to evaluate and investigate the performance of our method using the classifiers trained from both 10-cross validation and different sizes of dataset. The results show that the classification model with our proposed gene selection algorithm has higher prediction accuracy and that our method can still achieve high accuracy when the number of training instances is small. Compared with other methods published in the literature, our method yields good results.
ECBGS can potentially be used in miRNA expression profiling for cancer classification. Moreover, we believe that our mechanism is also applicable to other feature selection problems and can be expanded to other classifications of disease states.