您好,欢迎访问三七文档
AGeneralMeasureofRuleInterestingnessSzymonJaroszewiz,DanA.SimoviiJanuary15,2002AbstratThepaperpresentsanewgeneralmeasureofruleinterestingness.Manyknownmeasuressuhas 2,ginigainorentropygainanbeob-tainedfromthismeasurebysettingsomenumerialparametersrepresent-ingtheamountoftrustwehaveintheestimatesofertainprobabilitiesfromthedata.Moreoverweshowthatthereisaontinuumofmeasureshaving 2,Ginigainandentropygainasboundaryases.Propertiesandexperimentalevaluationofthenewmeasurearealsopresented.Keywords:interestingnessmeasure,distribution,CziserdivergeneKullbak-Leiblerdivergene,rule.1IntrodutionDeterminingtheinterestingnessofrulesisanimportantdataminingprob-lem.Manydataminingalgorithmsprodueenormousamountsofrules,makingitimpossiblefortheusertoanalyzeallofthembyhand.Itisthusessentialtoestablishsomemeasurebywhihrulesinterestingnessanbeexpressednumeriallyandused,forexample,tosortthedisoveredrules.Manysuhmeasureshavebeenproposed,andusedinliterature(see[1℄forasurvey).InthispaperweonentrateonmeasuresthatassesshowmuhknowledgewegainonthejointdistributionofasetofattributesQfromtheknowingthejointdistributionofsomesetofattributesP.Examplesofsuhmeasuresareentropygain,mutualinformation,Ginigain, 2[7,9,3,1,11,10℄.Therulesonsideredherearethusdi erentfromassoiationrulesstudiedindatamining,sineweonsiderfulljointdistributionsofbothanteedentandonsequent,whileassoiationrulesonsideronlytheprobabilityofallattributeshavingsomespei edvalue.Thisapproahhastheadvantageofnaturalappliabilitytomulitvaluedattributes.Inthispaperwedemonstratethatalltheabovementionedmeasuresarespeialasesofamoregeneralparametrimeasureofinterestingness,andbyhoosingtwonumerialparametersaontinuumofmeasuresanbeobtainedontainingseveralwell-knowninterestingmeasuresasspeialases.Next,wegivesomeessentialde nitions.1De nition1Aprobabilitydistributionisamatrixoftheform = x1 xmp1 pm ;wherepi 0for1 i mandPmi=1pi=1. isanuniformdistributionifp1= =pm=1m.Anm-valueduniformdistributionwillbedenotedbyUm.Let =(T;H; )beadatabasetable,whereTisthenameofthetable,Hisitsheading,and isitsontent.IfA2Hisanattributeof ,thedomainofAin isdenotedbydom(A).Theprojetionofatuplet2 onasetofattributesL Hisdenotedbyt[L℄.Formoreonrelationalnotationandterminologysee[13℄.De nition2ThedistributionofasetofattributesL=fA1;:::;Angisthematrix L; = ‘1 ‘rp1 pr ;(1)wherer=Qnj=1jdom(Aj)j,‘i2dom(A1) dom(An),andpi=jt2 jt[L℄=‘ijj jfor1 i r.Thesubsript willbeomittedwhenthetable islearfromontext.SupposethatthedistributionoftheattributesetLinthetable =(T;H; )is L= ‘1 ‘rp1 pr :TheHavrda-Charv at -entropyoftheattributesetL(see[6℄)isde nedas:H (L)=11 rXj=1p j 1!:Thelimitase,when tendstowards1yieldstheShannonentropy:H(L)= rXj=1pjlogpjAnotherimportantaseisobtainedwhen =2.Inthisase,weobtaintheGiniindexofL(see[1℄)givenby:gini(L)=1 rXj=1p2j:IfL;Karetwosetsofattributesofatable thathavethedistributions L= l1 lmp1 pm ;and K= k1 knq1 qn ;2thentheonditionalShannonentropyofLonditioneduponKisgivenbyH(LjK)= mXi=1nXj=1pijlogpijqj;wherepij=jft2 jt[L℄=‘iandt[K℄=kjgjj jfor1 i mand1 j n.Similarly,theGinionditionalindexofthesedistributionsis:gini(LjK)=1 mXi=1nXj=1p2ijqj:Thesede nitionsallowustointroduetheShannongain(alledentropygaininliterature[7℄)andtheGinigainde nedas:gaingini(L;K)=gini(L) gini(LjK);gainshannon(L;K)=H(L) H(LjK)=H(L)+H(K) H(L[K);(2)respetively.NotiethattheShannongainisidentialtothemutualinformationbetweenattributesetsPandQ[7℄.FortheGinigainweanwrite:gaingini(L;K)=mXi=1nXj=1p2ijqj mXi=1p2i(3)Theprodutofthedistributions P; Q,where P= x1 xmp1 pm ;and Q= y1 ynq1 qn ;isthedistribution P Q= (x1;y1) (xm;yn)p1q1 pmqn :TheattributesetsP;Qareindependentif PQ= P Q,wherePQisanabbreviationforP[Q.De nition3Aruleisapairofattributesets(P;Q).IfP;Q H,where =(T;H; )isatable,thenwereferto(P;Q)asaruleof .If(P;Q)isarule,thenwerefertoPastheanteedentandtoQastheonsequentoftherule.Arule(P;Q)willbedenoted,followingtheprevalentonventionintheliterature,byP!Q.Thisbroaderde nitionofrulesoriginatesin[3℄,whereruleswerere-plaedbydependeniesinordertoapturestatistialdependeneinboththepreseneandabseneofitemsinitemsets.Thesigni aneofthisdependenewasmeasuredbythe 2test,andourapproahisafurtherextensionofthatpointofview.Thenotionofdistributiondivergeneisentraltotherestofthepaper.3De nition4LetDbethelassofdistributions.Adistributiondiver-geneisafuntionD:D D !Rsuhthat:1.D( ; 0) 0andD( ; 0)=0ifandonlyif = 0forevery ; 02D.2.When 0is xed,D( ; 0)isaonvexfuntionof ;inotherwords,if =a1 1+ +ak k,wherea1+:::+ak=1,thenD( ; 0) kXi=1aiD( i; 0):AnimportantlassofdistributiondivergeneswasobtainedbyCziszarin[4℄as:D ( ; 0)=nXi=1qi piqi ;where = k1 knp1 pn ;and 0= l1 lnq1 qn ;aretwodistributionsand :R !Risatwiedi erentiableonvexfuntionsuhthat (1)=0.Wewillalsomakeanadditionalassumptionthat0 (00)=0tohandletheasewhenforsomeibothpiandqiarezero.Ifforsomei,pi0,andqi=0thevalueofD ( ; 0)isunde ned.TheCziszardivergenesatis esproperties(1)and(2)givenabove(see[6℄).ThefollowingresultshowstheinvarianeofCziszardivergenewithrespettodistributionprodut:The
本文标题:A General Measure of Rule Interestingness
链接地址:https://www.777doc.com/doc-3765295 .html