决策树之五:连续变量计算过程
下面举例说明如何划分,给定数据集如下(数据集来自周志华《机器学习》) 可复制数据集在如下评论第一条! |
对连续属性的处理如下: | |||||||||
1. 对特征的取值进行升序排序 | |||||||||
因此对于数据集中的属性“密度”,决策树开始学习时,根节点包含的17个训练样本在该属性上取值均不同。我们先把“密度”这些值从小到大排序: |
0.243 | 0.245 | 0.343 | 0.36 | 0.403 | 0.437 | 0.481 | 0.556 | 0.593 | 0.608 | 0.634 | 0.639 | 0.657 | 0.666 | 0.697 | 0.719 | 0.774 |
|
(0.243+0.245)/2=0.244 | (0.245+0.343)/2=0.294 | (0.343+0.36)/2=0.352 | (0.36+0.403)/2=0.382 | ||||||||
(0.403+0.437)/2=0.420 | (0.437+0.481)/2=0.459 | (0.481+0.556)/2=0.519 | (0.556+0.593)/2=0.575 | ||||||||
(0.593+0.608)/2=0.601 | (0.608+0.634)/2=0.621 | (0.634+0.639)/2=0.637 | (0.639+0.657)/2=0.648 | ||||||||
(0.657+0.666)/2=0.662 | (0.666+0.697)/2=0.682 | (0.697+0.719)/2=0.708 | (0.719+0.774)/2=0.747 |
0.244 | 0.294 | 0.352 | 0.382 | 0.420 | 0.459 | 0.519 | 0.575 | 0.601 | 0.621 | 0.637 | 0.648 | 0.662 | 0.682 | 0.708 | 0.747 |
2.Gain(D,a,t)是样本集D基于划分点t二分后的信息增益。划分的时候,选择使用Gain(D,a,t)最大的划分点 | |||||||||
Ent(D)=-(8/17*LOG(8/17,2)+9/17*LOG(9/17,2))=0.998 | |||||||||
记录为{a1,a2,…,an}.基于划分点t可将D分为子集D-t和D+t,其中D-t是包含那些在属性a上取值不大于t的样本,D+t则是包含那些在属性a上取值大于t的样本。显然,对相邻的属性取值ai与ai+1来说,t在区间a(ai,ai+1)中取任意值所产生的划分结果相同。因此,对连续属性a,我们可考察包含n-1个元素的候选划分点集合 |
即把区间(ai,ai+1)的中位点ai+ai+1作为候选划分点。然后我们就可以像前面处理离散属性值那样来考虑这些划分点,选择最优的划分点进行样本集合的划分,使用公式如下: |
其中Gain(D,a,t)是样本集D基于划分点t二分后的信息增益。划分的时候,选择使用Gain(D,a,t)最大的划分点。 |
下面开始计算t取不同值时的信息增益: | |
当t=0.244时 |
0.244 | 0.294 | 0.352 | 0.382 | 0.420 | 0.459 | 0.519 | 0.575 | 0.601 | 0.621 | 0.637 | 0.648 | 0.662 | 0.682 | 0.708 | 0.747 |
D-t={0.243} | |||||||||
D+t={0.245,0.343,0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774} |
Ent(D-t)=-(0*LOG(0,2)+1*LOG(1,2))=0 | 好瓜:是=0,好瓜:否=1 | ||||||
Ent(D+t)=-(8/16*LOG(8/16,2)+8/16*LOG(8/16,2))=1 | 好瓜:是=8,好瓜:否=8 |
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((1/17)*0+(16/17)*1)=0.057 |
当t=0.294时 |
0.244 | 0.294 | 0.352 | 0.382 | 0.420 | 0.459 | 0.519 | 0.575 | 0.601 | 0.621 | 0.637 | 0.648 | 0.662 | 0.682 | 0.708 | 0.747 |
D-t={0.243,0.245} | |||||||||
D+t={0.343,0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774} |
Ent(D-t)=-(0*LOG(0,2)+2/2*LOG(2/2,2))=0 | 好瓜:是=0,好瓜:否=2 | ||||||
Ent(D+t)=-(8/15*LOG(8/15,2)+7/15*LOG(7/15,2))=0.997 | 好瓜:是=8,好瓜:否=7 |
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((2/17)*0+(15/17)*0.997)=0.118 |
当t=0.352时 |
0.244 | 0.294 | 0.352 | 0.382 | 0.420 | 0.459 | 0.519 | 0.575 | 0.601 | 0.621 | 0.637 | 0.648 | 0.662 | 0.682 | 0.708 | 0.747 |
D-t={0.243,0.245,0.343} | |||||||||
D+t={0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774} |
Ent(D-t)=-(0*LOG(0,2)+3/3*LOG(3/3,2))=0 | 好瓜:是=0,好瓜:否=3 | ||||||
Ent(D+t)=-(8/14*LOG(8/14,2)+6/14*LOG(6/14,2))=0.985 | 好瓜:是=8,好瓜:否=6 |
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((3/17)*0+(14/17)*0.985)=0.187 |
当t=0.382时 |
0.244 | 0.294 | 0.352 | 0.382 | 0.420 | 0.459 | 0.519 | 0.575 | 0.601 | 0.621 | 0.637 | 0.648 | 0.662 | 0.682 | 0.708 | 0.747 |
D-t={0.243,0.245,0.343,0.360} | |||||||||
D+t={0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774} |
Ent(D-t)=-(0*LOG(0,2)+4/4*LOG(4/4,2))=0 | 好瓜:是=0,好瓜:否=4 | ||||||
Ent(D+t)=-(8/13*LOG(8/13,2)+5/13*LOG(5/13,2))=0.961 | 好瓜:是=8,好瓜:否=5 |
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((4/17)*0+(13/17)*0.961)=0.263 |
当t=0.420时 |
0.244 | 0.294 | 0.352 | 0.382 | 0.420 | 0.459 | 0.519 | 0.575 | 0.601 | 0.621 | 0.637 | 0.648 | 0.662 | 0.682 | 0.708 | 0.747 |
D-t={0.243,0.245,0.343,0.360,0.403} | |||||||||
D+t={0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774} |
Ent(D-t)=-(1/5*LOG(1/5,2)+4/5*LOG(4/5,2))=0.722 | 好瓜:是=1,好瓜:否=4 | ||||||
Ent(D+t)=-(8/12*LOG(8/12,2)+5/12*LOG(5/13,2))=0.980 | 好瓜:是=7,好瓜:否=5 |
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((5/17)*0.722+(12/17)*0.980)=0.094 |
当t=0.459时 |
D-t={0.243,0.245,0.343,0.360,0.403,0.437} | |||||||||
D+t={0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774} |
0.244 | 0.294 | 0.352 | 0.382 | 0.420 | 0.459 | 0.519 | 0.575 | 0.601 | 0.621 | 0.637 | 0.648 | 0.662 | 0.682 | 0.708 | 0.747 |
Ent(D-t)=-(2/6*LOG(2/6,2)+4/6*LOG(4/6,2))=0.918 | 好瓜:是=2,好瓜:否=4 | ||||||
Ent(D+t)=-(6/11*LOG(6/11,2)+5/11*LOG(5/11,2))=0.994 | 好瓜:是=6,好瓜:否=5 |
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((6/17)*0.918+(11/17)*0.994)=0.031 |
当t=0.519时 |
0.244 | 0.294 | 0.352 | 0.382 | 0.420 | 0.459 | 0.519 | 0.575 | 0.601 | 0.621 | 0.637 | 0.648 | 0.662 | 0.682 | 0.708 | 0.747 |
D-t={0.243,0.245,0.343,0.360,0.403,0.437,0.481} | |||||||||
D+t={0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774} |
Ent(D-t)=-(3/7*LOG(3/7,2)+4/7*LOG(4/7,2))=0.985 | 好瓜:是=3,好瓜:否=4 | ||||||
Ent(D+t)=-(5/10*LOG(5/10,2)+5/10*LOG(5/10,2))=1 | 好瓜:是=5,好瓜:否=5 |
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((7/17)*0.985+(10/17)*1)=0.004 |
当t=0.575时 |
0.244 | 0.294 | 0.352 | 0.382 | 0.420 | 0.459 | 0.519 | 0.575 | 0.601 | 0.621 | 0.637 | 0.648 | 0.662 | 0.682 | 0.708 | 0.747 |
D-t={0.243,0.245,0.343,0.360,0.403,0.437,0.481,0.556} | ||||||||
D+t={0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774} |
Ent(D-t)==-(4/8*LOG(4/8,2)+4/8*LOG(4/8,2))=1.000 | 好瓜:是=4,好瓜:否=4 | ||||||
Ent(D+t)=-(4/9*LOG(4/9,2)+5/9*LOG(5/9,2))=1 | 好瓜:是=4,好瓜:否=5 |
.˙.Gain(D,a,t)=Gain(D,密度,0.243)==0.998-((8/17)*1+(9/17)*0.991)=0.003 |
当t=0.601时 |
0.244 | 0.294 | 0.352 | 0.382 | 0.420 | 0.459 | 0.519 | 0.575 | 0.601 | 0.621 | 0.637 | 0.648 | 0.662 | 0.682 | 0.708 | 0.747 |
D-t={0.243,0.245,0.343,0.360,0.403,0.437,0.481,0.556,0.593} | ||||||||
D+t={0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774} |
Ent(D-t)==-(4/9*LOG(4/9,2)+5/9*LOG(5/9,2))=0.991 | 好瓜:是=4,好瓜:否=5 | ||||||
Ent(D+t)=-(4/8*LOG(4/8,2)+4/8*LOG(4/8,2))=1 | 好瓜:是=4,好瓜:否=4 |
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((9/17)*0.991+(8/17)*1)=0.003 |
比较能够发现,当t=0.381时,Gain(D,a,t)最大为0.263.因此选择该划分点。 |