常用分类算法在不同样本量和类分布的不平衡数据中的分类效果比较

doi:10.3969/j.issn.1006-5253.2015.01.007

摘要
图/表
参考文献
相关文章 (0)

全文: PDF (940 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要目的比较常用分类算法在不同样本量和稀有类比例的不平衡数据集中的分类效果。方法采用Monte Carlo模拟,产生不同样本量和稀有类比例的随机样本,并分别用各分类算法进行分类,比较各算法的F₁值和AUC值。结果各算法的分类效果均随样本量和稀有类比例增加而增加,F₁值的变化更明显,稀有类占30%和20%时,F₁值变化幅度<0.2,且均达到0.6以上(AUC>0.83)。logistic回归和神经网络在样本量为150和500时要优于其它三种算法,样本量5 000稀有类占5%和3%时,随机森林的F₁值要明显高于其它算法。结论 F₁值受样本量和类分布影响较大,稀有类比例不太低时各算法仍具有可接受的分类效果,小样本时logistic回归和神经网络效果较好,稀有类比例较低且样本量大时随机森林效果要优于其余算法。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	袁联雄
	佘玲玲
	林爱华
	骆福添

关键词 ：不平衡数据集, 分类算法, Monte, Carlo

Abstract：Objective To compare the classification effects of commonly used classification algorithms in different sample sizes and imbalanced data of minority class proportion. Methods An Monte Carlo approach was applied to generate random data set in different sample size and class-distribution, then commonly used classification algorithms were chosen to calculate F₁ value and AUC value. Results F₁ value and AUC value of all algorithms increased following the increase of sample size and minority class proportion, while F₁ value was more sensitive. Logistic regression and neural network showed advantage to other method in small sample sizes, and F₁ value of random forest was superior to others when minority class′s percent was 5% or 3% with the sample size of 5 000. Conclusion Sample sizes and class distribution have great influence on F₁ value, and Logistic regression and neural network may be more appropriate in small sample size, while random forest is powerful when the percent of minority class is very low and sample size is large enough.

Key words： Imbalanced data Classification algorithm Monte Carlo

收稿日期: 2015-02-02

通讯作者: 骆福添,Email:luoft@mail.sysu.edu.cn

引用本文:

袁联雄，佘玲玲，林爱华，骆福添. 常用分类算法在不同样本量和类分布的不平衡数据中的分类效果比较[J]. 中国医院统计, 2015, 22(1): 22-26.
Yuan Lianxiong, She Lingling, Lin Aihua, Luo Futian. Classification effects of classification algorithms in imbalanced data of different sample sizes and class-distribution. journal1, 2015, 22(1): 22-26.

链接本文:

http://manu58.magtech.com.cn/Jwk_zgyytj/CN/10.3969/j.issn.1006-5253.2015.01.007 或 http://manu58.magtech.com.cn/Jwk_zgyytj/CN/Y2015/V22/I1/22