179种分类器大评测

179种分类器大评测

179种分类器大评测

300包薯片,我们吃完了!
179种分类器,我们测完了!

资料来源

2014年名为

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

的文章中(截止目前,被引 823 次),研究人员对 17 个家族,179 个分类器,在 121 个数据集上的表现进行了评测!

上结果

分类器排名

第一是随机森林(Random Forest, RF)

使用 R 语言中 Caret 库实现,which achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. <- 这句话我咋没理顺…

突然觉得实验室弥漫的随机森林风似乎很清新…

虽然数据上是第一,但与第二并未表现出效果上的显著性差异(意思就是,第二和第一差不多)

第二是高斯核-支持向量机(SVM with Gaussian Kernel)

使用 C 语言中 LibSVM 库实现,which achieves 92.3% of the maximum accuracy.

当然还有一些模型也不错,显著优于其他的分类器,包括:

  1. SVM with polynomial kernels
  2. extreme learning machine with Gaussian kernel
  3. C5.0
  4. avNNet (a committee of multi-layer perceptrons implemented in R with the caret package)

家族排名

第一名,随机森林家族,前5里有3个该家族的

第二名:SVM家族,前10里有4个

第三名:神经网络家族,前20里有5个

第四名:Boosting家族,前20里有3个

179种分类器大评测

图中,使用 Friedman rank 法评估每个家族中的算法性能(分越低越好),上半部分是每个家族的算法得分分布,下半部分是各家族中的最低分

局限性

有没有觉得,RF 那么好,那最近为啥火的是 deep learning ?难道 2014 年的时候 还有没 deep learning 算法?

实际上,测试范围虽然包括 121 个数据集,但全部取自于 UCI data base (哈?不知道 UCI ?看下一小节!),且不包括大型数据集,自然体现不出来深度学习的优势

深度学习相比RF、SVM等算法,主要优势体现在其性能可以随数据量的增加持续性上升,借一张吴恩达老师 DeepLearning.ai 课程第一门第一周的ppt

179种分类器大评测

随着数据量的增大,传统算法的拟合能力限制于其模型本身,但神经网络可以通过不断扩展网络结构,持续提升数据拟合能力

于此同时,UCI 更多情况下被研究人员作为模拟任务的数据集使用,与现今研究中的真实任务还是具备一定差异性的,不能以偏概全

辩证考虑哦 ●0●

UCI Data Sets

UCI数据库是加州大学欧文分校(University of CaliforniaIrvine)提出的用于机器学习的数据库,截止目前,有426个数据集,是一个常用的标准测试数据集

179种分类器大评测

官网 About

The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited “papers” in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged.

附录:具体测评算法

详见论文 2.2 Classifiers,这里只罗列下分类器的家族和代表算法,剩余大多数算法都是某种经典算法的小变体

值得注意的是,很多算法都特别老(上个世纪的),选择使用时要慎重

Discriminant analysis (DA): 20 classifiers

  1. LDA, linear discriminant analysis
  2. SDA, shrinkage discriminant analysis
  3. QDA, quadratic discriminant analysis
  4. FDA, flexible discriminant analysis
  5. MDA, mixture discriminant analysis
  6. PDA, penalized discriminant analysis
  7. RDA, regularized discriminant analysis
  8. HDDA, high-dimensional discriminant analysis

Bayesian (BY) approaches: 6 classifiers

  1. NaiveBayes
  2. BayesNet

Neural networks (NNET): 21 classifiers

  1. RBF, radial basis functions neural network
  2. MLP, multi-layer perceptron
  3. avNNet, creates a committee of 5 MLPs
  4. PNN, probabilistic neural network

(确实没对比深度神经网络)

Support vector machines (SVM): 10 classifiers

  1. SVM,support vector machine
  2. 各种kernal(未来小组分享会讨论kernal的问题)、变体

Decision trees (DT): 14 classifiers

  1. rpart
  2. C5.0
  3. J48
  4. RandomTree
  5. DecisionStump, one-node decision tree

Rule-based methods (RL): 12 classifiers

  1. C5.0Rules, uses the same function C5.0 (in the C50 package) as classifiers C5.0Tree t, but creating a collection of rules instead of a classification tree.

Boosting (BST): 20 classifiers

  1. adaboost
  2. logitboost
  3. AdaBoostM1
  4. MultiBoostAB

Bagging (BAG): 24 classifiers

  1. bagging
  2. treebag
  3. ldaBag
  4. nbBag
  5. svmBag
  6. nnetBag

Random Forests (RF): 8 classifiers

  1. random forest

Other ensembles (OEN): 11 classifiers

Generalized Linear Models (GLM): 5 classifiers

Nearest neighbor methods (NN): 5 classifiers

  1. KNN

Partial least squares and principal component regression (PLSR): 6 classifiers

Logistic and multinomial regression (LMR): 3 classifiers

  1. Logistic

Multivariate adaptive regression splines (MARS): 2 classifiers

Other Methods (OM): 10 classifiers

以上

179种分类器大评测

什么鬼,还有彩蛋??

300包薯片评测中,最好吃的薯片是 ——

Calbee IMO&MAME 青豆/昆布 !!!

179种分类器大评测

179种分类器大评测

179种分类器大评测

可据说 Calbee 家的薯条三兄弟才真好吃?

179种分类器大评测

179种分类器大评测

179种分类器大评测

别滑了… 真没了…

( ̄ε(# ̄)