决策树之五:连续变量计算过程

下面举例说明如何划分,给定数据集如下(数据集来自周志华《机器学习》)

可复制数据集在如下评论第一条!

决策树之五:连续变量计算过程

对连续属性的处理如下:

1.      对特征的取值进行升序排序
给定训练集D和连续属性a,假定a在D上出现了n个不同的取值,先把这些值从小到大排序

因此对于数据集中的属性“密度”,决策树开始学习时,根节点包含的17个训练样本在该属性上取值均不同。我们先把“密度”这些值从小到大排序:
0.243 0.245 0.343 0.36 0.403 0.437 0.481 0.556 0.593 0.608 0.634 0.639 0.657 0.666 0.697 0.719 0.774
决策树之五:连续变量计算过程根据计算Ta的公式,可得如下结果值:
(0.243+0.245)/2=0.244 (0.245+0.343)/2=0.294 (0.343+0.36)/2=0.352 (0.36+0.403)/2=0.382
(0.403+0.437)/2=0.420 (0.437+0.481)/2=0.459 (0.481+0.556)/2=0.519 (0.556+0.593)/2=0.575
(0.593+0.608)/2=0.601 (0.608+0.634)/2=0.621 (0.634+0.639)/2=0.637 (0.639+0.657)/2=0.648
(0.657+0.666)/2=0.662 (0.666+0.697)/2=0.682 (0.697+0.719)/2=0.708 (0.719+0.774)/2=0.747
0.244 0.294 0.352 0.382 0.420 0.459 0.519 0.575 0.601 0.621 0.637 0.648 0.662 0.682 0.708 0.747 

2.Gain(D,a,t)是样本集D基于划分点t二分后的信息增益。划分的时候,选择使用Gain(D,a,t)最大的划分点

Ent(D)=-(8/17*LOG(8/17,2)+9/17*LOG(9/17,2))=0.998  
记录为{a1,a2,…,an}.基于划分点t可将D分为子集D-t和D+t,其中D-t是包含那些在属性a上取值不大于t的样本,D+t则是包含那些在属性a上取值大于t的样本。显然,对相邻的属性取值ai与ai+1来说,t在区间a(ai,ai+1)中取任意值所产生的划分结果相同。因此,对连续属性a,我们可考察包含n-1个元素的候选划分点集合决策树之五:连续变量计算过程

即把区间(ai,ai+1)的中位点ai+ai+1作为候选划分点。然后我们就可以像前面处理离散属性值那样来考虑这些划分点,选择最优的划分点进行样本集合的划分,使用公式如下:

决策树之五:连续变量计算过程

其中Gain(D,a,t)是样本集D基于划分点t二分后的信息增益。划分的时候,选择使用Gain(D,a,t)最大的划分点。

下面开始计算t取不同值时的信息增益:

当t=0.244时

0.244 0.294 0.352 0.382 0.420 0.459 0.519 0.575 0.601 0.621 0.637 0.648 0.662 0.682 0.708 0.747 
D-t={0.243}
D+t={0.245,0.343,0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}
Ent(D-t)=-(0*LOG(0,2)+1*LOG(1,2))=0 好瓜:是=0,好瓜:否=1
Ent(D+t)=-(8/16*LOG(8/16,2)+8/16*LOG(8/16,2))=1 好瓜:是=8,好瓜:否=8
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((1/17)*0+(16/17)*1)=0.057
决策树之五:连续变量计算过程

当t=0.294

0.244 0.294 0.352 0.382 0.420 0.459 0.519 0.575 0.601 0.621 0.637 0.648 0.662 0.682 0.708 0.747 
D-t={0.243,0.245}
D+t={0.343,0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}
Ent(D-t)=-(0*LOG(0,2)+2/2*LOG(2/2,2))=0 好瓜:是=0,好瓜:否=2
Ent(D+t)=-(8/15*LOG(8/15,2)+7/15*LOG(7/15,2))=0.997 好瓜:是=8,好瓜:否=7
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((2/17)*0+(15/17)*0.997)=0.118
决策树之五:连续变量计算过程

当t=0.352时

0.244 0.294 0.352 0.382 0.420 0.459 0.519 0.575 0.601 0.621 0.637 0.648 0.662 0.682 0.708 0.747 
D-t={0.243,0.245,0.343}
D+t={0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}
Ent(D-t)=-(0*LOG(0,2)+3/3*LOG(3/3,2))=0 好瓜:是=0,好瓜:否=3
Ent(D+t)=-(8/14*LOG(8/14,2)+6/14*LOG(6/14,2))=0.985 好瓜:是=8,好瓜:否=6
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((3/17)*0+(14/17)*0.985)=0.187
决策树之五:连续变量计算过程

当t=0.382时

0.244 0.294 0.352 0.382 0.420 0.459 0.519 0.575 0.601 0.621 0.637 0.648 0.662 0.682 0.708 0.747 
D-t={0.243,0.245,0.343,0.360}
D+t={0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}
Ent(D-t)=-(0*LOG(0,2)+4/4*LOG(4/4,2))=0 好瓜:是=0,好瓜:否=4
Ent(D+t)=-(8/13*LOG(8/13,2)+5/13*LOG(5/13,2))=0.961 好瓜:是=8,好瓜:否=5
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((4/17)*0+(13/17)*0.961)=0.263

决策树之五:连续变量计算过程

当t=0.420时

0.244 0.294 0.352 0.382 0.420 0.459 0.519 0.575 0.601 0.621 0.637 0.648 0.662 0.682 0.708 0.747 
D-t={0.243,0.245,0.343,0.360,0.403}
D+t={0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}
Ent(D-t)=-(1/5*LOG(1/5,2)+4/5*LOG(4/5,2))=0.722 好瓜:是=1,好瓜:否=4
Ent(D+t)=-(8/12*LOG(8/12,2)+5/12*LOG(5/13,2))=0.980 好瓜:是=7,好瓜:否=5
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((5/17)*0.722+(12/17)*0.980)=0.094

决策树之五:连续变量计算过程

当t=0.459时

D-t={0.243,0.245,0.343,0.360,0.403,0.437}
D+t={0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}
0.244 0.294 0.352 0.382 0.420 0.459 0.519 0.575 0.601 0.621 0.637 0.648 0.662 0.682 0.708 0.747 

Ent(D-t)=-(2/6*LOG(2/6,2)+4/6*LOG(4/6,2))=0.918 好瓜:是=2,好瓜:否=4
Ent(D+t)=-(6/11*LOG(6/11,2)+5/11*LOG(5/11,2))=0.994 好瓜:是=6,好瓜:否=5
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((6/17)*0.918+(11/17)*0.994)=0.031
决策树之五:连续变量计算过程

当t=0.519时

0.244 0.294 0.352 0.382 0.420 0.459 0.519 0.575 0.601 0.621 0.637 0.648 0.662 0.682 0.708 0.747 
D-t={0.243,0.245,0.343,0.360,0.403,0.437,0.481}
D+t={0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}
Ent(D-t)=-(3/7*LOG(3/7,2)+4/7*LOG(4/7,2))=0.985 好瓜:是=3,好瓜:否=4
Ent(D+t)=-(5/10*LOG(5/10,2)+5/10*LOG(5/10,2))=1 好瓜:是=5,好瓜:否=5
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((7/17)*0.985+(10/17)*1)=0.004
决策树之五:连续变量计算过程

当t=0.575时

0.244 0.294 0.352 0.382 0.420 0.459 0.519 0.575 0.601 0.621 0.637 0.648 0.662 0.682 0.708 0.747 
D-t={0.243,0.245,0.343,0.360,0.403,0.437,0.481,0.556
D+t={0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}
Ent(D-t)==-(4/8*LOG(4/8,2)+4/8*LOG(4/8,2))=1.000 好瓜:是=4,好瓜:否=4
Ent(D+t)=-(4/9*LOG(4/9,2)+5/9*LOG(5/9,2))=1 好瓜:是=4,好瓜:否=5
.˙.Gain(D,a,t)=Gain(D,密度,0.243)==0.998-((8/17)*1+(9/17)*0.991)=0.003

决策树之五:连续变量计算过程

当t=0.601时

0.244 0.294 0.352 0.382 0.420 0.459 0.519 0.575 0.601 0.621 0.637 0.648 0.662 0.682 0.708 0.747 
D-t={0.243,0.245,0.343,0.360,0.403,0.437,0.481,0.556,0.593
D+t={0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}

Ent(D-t)==-(4/9*LOG(4/9,2)+5/9*LOG(5/9,2))=0.991 好瓜:是=4,好瓜:否=5
Ent(D+t)=-(4/8*LOG(4/8,2)+4/8*LOG(4/8,2))=1 好瓜:是=4,好瓜:否=4
.˙.Gain(D,a,t)=Gain(D,密度,0.243)=0.998-((9/17)*0.991+(8/17)*1)=0.003
决策树之五:连续变量计算过程

比较能够发现,当t=0.381时,Gain(D,a,t)最大为0.263.因此选择该划分点。