您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 经营企划 > 决策树DecisionTree
决策树DecisionTree简介归纳分类算法有监督非参数学习算法贪心算法自顶向下递归决策树的结构决策树的结构4根部节点(rootnode)非叶子节点(non-leafnode)(代表测试的条件,对数据属性的测试)分支(branches)(代表测试的结果)叶节点(leafnode)(代表分类后所获得的分类标记)2019/9/2决策树分类1.训练阶段从给定的训练数据集DB,构造出一棵决策树2.分类阶段从根开始,按照决策树的分类属性逐层往下划分,直到叶节点,获得概念(决策、分类)结果。ExampleofaDecisionTreeAnotherExampleofDecisionTreeApplyModeltoTestDataRefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestDataStartfromtherootoftree.ApplyModeltoTestDataRefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestDataApplyModeltoTestDataRefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestDataApplyModeltoTestDataRefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestDataApplyModeltoTestDataRefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestDataApplyModeltoTestDataRefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestDataAssignCheatto“No”决策树原理基本算法(贪心算法)自上而下分而治之离散值递归启发式规则停止分割的条件同一个类别没有属性Checkfortheabovebasecases.Foreachattributea,findthenormalizedinformationgainratiofromsplittingona.Leta_bestbetheattributewiththehighestnormalizedinformationgain.Createadecisionnodethatsplitsona_best.Recuronthesublistsobtainedbysplittingona_best,andaddthosenodesaschildrenofnode.Pseudocode例子:算法过程TidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10RefundYesNo1.samples={1,2,3,4,5,6,7,8,9,10}attribute_list={Refund,MarSt,TaxInc}假设选择Refund为最优分割属性:2.samples={1,4,7}attribute_list={MarSt,TaxInc}3.samples={2,3,5,6,8,9,10}attribute_list={MarSt,TaxInc}例子:算法过程TidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10RefundYesNosamples中所有样本属于同一个类Cheat=No2.samples={1,4,7}attribute_list={MarSt,TaxInc}NO例子:算法过程TidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10RefundYesNo假设选择MarSt为最优分割属性:3.samples={2,3,5,6,8,9,10}attribute_list={MarSt,TaxInc}NOMarStSingleMarriedDivorced4.samples={3,8,10},attribute_list={TaxInc}5.samples={5,7},attribute_list={TaxInc}6.samples={2,9},attribute_list={TaxInc}例子:算法过程TidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10RefundYesNo选择TaxInc为最优分割属性:4.samples={3,8,10}attribute_list={TaxInc}NOMarStSingleMarriedDivorcedTaxInc80K=80KYESNOFatherofinformationtheory证明熵与信息内容的不确定程度有等价关系系统科学领域三大论之一C.Shannon的信息论信息熵熵(entropy)121212222entropy(,,,)logloglognnnppppppppp描述物质系统状态:该状态可能出现的程度。平均信息量若一个系统中存在多个事件E1,E2,…En每个事件出现的概率是p1,p2,…pn则这个系统的平均信息量是指的是系统的混乱的程度!(bits)系统越无序、越混乱,熵就越大。构造决策树,熵定义为无序性度量。选择一个属性划分数据,使得子女节点上数据的类值(例中“yes”或“no”)大部分都相同(低无序性)。如果一个节点上的数据类值在可能的类值上均匀分布,则称节点的熵(无序性)最大。如果一个节点上的数据的类值对于所有数据都相同,则熵最小。通过分裂,得到尽可能纯的节点。这相当于降低系统的熵。例子气象数据集,都是标称属性什么因素影响是否去打网球?OutlookTemperatureHumidityWindyPlay?sunnyhothighfalseNosunnyhothightrueNoovercasthothighfalseYesrainmildhighfalseYesraincoolnormalfalseYesraincoolnormaltrueNoovercastcoolnormaltrueYessunnymildhighfalseNosunnycoolnormalfalseYesrainmildnormalfalseYessunnymildnormaltrueYesovercastmildhightrueYesovercasthotnormalfalseYesrainmildhightrueNo1.基于天气的划分2.基于温度的划分3.基于湿度的划分4.基于有风的划分构造树训练样本的信息值第一棵树,属性,各叶节点的信息值第一棵树,属性,导致的信息增益依次,计算每棵树导致的信息增益选择获得最大信息增益的属性进行划分以此类推,递归,继续划分当所有叶节点都是纯的,划分过程终止(1)训练样本的信息值(基于类的比例)训练样本(用来创建树的数据集)在包含9个yes和5个no的根节点上,对应于信息值info([9,5])=0.940位→总的信息(2)第一棵树,属性,各叶节点的信息值基于天气(outlook)的划分,在叶节点的yes和no类的个数分别是[2,3],[4,0],和[3,2],而这些节点的信息值分别是:info([2,3])=0.971位→sunnyinfo([4,0])=0.0位→overcastinfo([3,2])=0.971位→rainYESNo合计sunny235overcast404rain325合计95(3)第一棵树,属性,导致的信息增益计算平均信息值。根据天气的树导致的信息增益为:基于类比例原来信息需求-基于天气属性划分之后得到的信息需求gain(outlook)=info([9,5])-info([2,3],[4,0],[3,2])=0.940-0.693=0.247位(4)依次,计算每棵树导致的信息增益为每个属性计算信息增益gain(outlook)=0.247位gain(temperature)=0.029位gain(humidity)=0.152位gain(windy)=0.048位(5)选择获得最大信息增益的属性进行划分最大信息增益:gain(outlook)=0.247位选择天气作为树的根节点的划分属性,其中一个子女节点是最纯的,并且这使它明显优于其他属性。湿度是次佳选择,它产生了一个几乎完全纯的较大的子女节点。(6)以此类推,递归,继续划分递归继续选择当天气为晴时,所达到的节点上的可能的深一层的分支除天气外,其他属性产生的信息增益分别为:gain(temperature)=0.571位gain(humidity)=0.971位gain(windy)=0.020位继续再选择湿度(humidity)作为划分属性天气,晴分支纯子节点(6)以此类推,递归,继续划分天气,晴分支,气温,gain(temperature)=0.571位天气,晴分支,湿度,gain(humidity)=0.971位(纯的子女节点)天气,晴分支,有风,gain(windy)=0.020位天气,雨分支,气温,gain(temperature)=0.020位天气,雨分支,湿度,gain(humidity)=0.020位天气,雨分支,有风,gain(windy)=0.971位(纯的子女节点)天气雨分支有风纯的子节点(7)当所有叶节点都是纯的,划分过程终止理想情况下,当所有叶节点都是纯的而使过程终止时,即当它们包含的实例都具有相同类时该过程终止。可能无法达到这种结果,因为无法
本文标题:决策树DecisionTree
链接地址:https://www.777doc.com/doc-698006 .html