Abstract:Objective To compare the classification effects of commonly used classification algorithms in different sample sizes and imbalanced data of minority class proportion. Methods An Monte Carlo approach was applied to generate random data set in different sample size and class-distribution, then commonly used classification algorithms were chosen to calculate F1 value and AUC value. Results F1 value and AUC value of all algorithms increased following the increase of sample size and minority class proportion, while F1 value was more sensitive. Logistic regression and neural network showed advantage to other method in small sample sizes, and F1 value of random forest was superior to others when minority class′s percent was 5% or 3% with the sample size of 5 000. Conclusion Sample sizes and class distribution have great influence on F1 value, and Logistic regression and neural network may be more appropriate in small sample size, while random forest is powerful when the percent of minority class is very low and sample size is large enough.
袁联雄, 佘玲玲, 林爱华, 骆福添. 常用分类算法在不同样本量和类分布的不平衡数据中的分类效果比较[J]. 中国医院统计, 2015, 22(1): 22-26.
Yuan Lianxiong, She Lingling, Lin Aihua, Luo Futian. Classification effects of classification algorithms in imbalanced data of different sample sizes and class-distribution. journal1, 2015, 22(1): 22-26.
[1] Longadge M R, Dongre M S S, Dr. Malik L. Class Imbalance Problem in Data Mining: Review[J]. International Journal of Computer Science and Network,2013,2(1):83-87. [2] Visa S, Ralescu A. Issues in Mining Imbalanced Data Sets-A Review Paper[C]. Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference,2005. [3] Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced data sets[J]. Computational Intelligence,2004,20(1):18-36. [4] Nakamura M, Kajiwara Y, Otsuka A, et al. LVQ-SMOTE-Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data[J]. BioData Min,2013,6(1):16. [5] Nathalie J. Learning from Imbalanced Data Sets: A Comparison of Various Strategies[C]. Proceedings of Learning from Imbalanced Data,2000:10-15. [6] Vaishali,G. An overview of classification algorithms for imbalanced datasets[J]. International Journal of Emerging Technology and Advanced Engineering,2012,2(4):42-47. [7] Haque M M, Skinner M K, Holder L B. Imbalanced Class Learning in Epigenetics[J]. Journal of Computational Biology,2014,21(7):492-507. [8] Cepeda M S, Boston R, Farrar J T. Comparison of Logistic Regression versus Propensity Score When the Number of Events Is Low and There Are Multiple Confounders[J]. American Journal of Epidemiology,2003,158(3):280-287. [9] Visa S, Ralescu A. The effect of imbalanced data class distribution on fuzzy classifiers-experimental study[C]. In Proc. of the FUZZ-IEEE Conference,2005:749-754. [10]郑恩辉,李平,宋执环.不平衡数据知识挖掘:类分布对支持向量机分类的影响[J]. 信息与控制,2005,34(6):703-708.