您好,欢迎访问三七文档
当前位置:首页 > 金融/证券 > 股票报告 > 第三章关联规则挖掘4new
数据挖掘第三章:关联规则挖掘本章内容3.1关联规则挖掘的基本概念3.2由事务数据库挖掘单维布尔关联规则3.3多层关联规则挖掘3.4多维关联规则挖掘3.5关联规则、相关性、因果关系的区别基本要求:掌握关联规则、多层关联规则、多维关联规则的基本概念,掌握各种关联规则挖掘的算法3.1基本概念关联规则挖掘:从事务数据库中发现项集之间有趣的关联.应用:购物篮分析(Basketdataanalysis)、交叉营销(cross-marketing)、产品目录设计(catalogdesign)等例子:规则形式:“BodyHead[support,confidence]”.buys(x,“diapers”)buys(x,“beers”)[0.5%,60%]major(x,“CS”)^takes(x,“DB”)grade(x,“A”)[1%,75%]3.1基本概念布尔型与数值型关联规则(基于要处理的数据类型)buys(x,“SQLServer”)^buys(x,“DMBook”)buys(x,“DBMiner”)[0.2%,60%]age(x,“30..39”)^income(x,“42..48K”)buys(x,“PC”)[1%,75%]单维与多维关联规则单层与多层关联规则Whatbrandsofbeersareassociatedwithwhatbrandsofdiapers?3.1基本概念规则度量:支持度和置信度TIDItem2000A,B,C1000A,C4000A,D5000B,E,FCustomerbuysdiaperCustomerbuysbothCustomerbuysbeer支持度s是指事务集D中包含AB的百分比Support(AB)=P(AB)置信度c是指D中包含A的事务同时也包含B的百分比Confidence(AB)=P(AB)=P(AB)/P(A)AC?CA?3.1基本概念规则度量:支持度和置信度设最小支持度阈值为50%,最小置信度阈值为50%,则有如下规则AC(50%,66.6%)CA(50%,100%)同时满足最小支持度阈值和最小置信度阈值的规则称作强规则3.1基本概念关联规则的形式化表示LetI={i1,i2,…,im}beasetofitems.LetDbeasetoftransactions,whereeachtransactionTisasetofitemssuchthatTI.AtransactionTcontainsX,asetofsomeitemsinI,ifXT.AnassociationruleisanimplicationoftheformXY,whereXI,YI,andXY=Ø.XYholdsinthetransactionsetDwithconfidencecifc%oftransactionsinDthatcontainXalsocontainY.XYhassupportsinthetransactionsetDifs%oftransactionsinDcontainXY.3.1基本概念事物数据:商场购物数据Marketbaskettransactions:t1:{bread,cheese,milk}t2:{apple,eggs,salt,yogurt}……tn:{biscuit,eggs,milk}Concepts:Anitem:anitem/articleinabasketI:thesetofallitemssoldinthestoreAtransaction:itemspurchasedinabasket;itmayhaveTID(transactionID)Atransactionaldataset:Asetoftransactions3.1基本概念事物数据:文档数据集每个文档都可以看作一个词袋(bagofwords)doc1:Student,Teach,Schooldoc2:Student,Schooldoc3:Teach,School,City,Gamedoc4:Baseball,Basketballdoc5:Basketball,Player,Spectatordoc6:Baseball,Coach,Game,Teamdoc7:Basketball,Team,City,Game本章内容3.1关联规则挖掘的基本概念3.2由事务数据库挖掘单维布尔关联规则3.3多层关联规则挖掘3.4多维关联规则挖掘3.5关联规则、相关性、因果关系的区别3.2由事务数据库挖掘布尔关联规则关联规则挖掘的原始算法:枚举每个可能的规则,计算其支持度与置信度。包含d个项目的数据集中可能的规则数目为x1231111dddkkdjjkdkdRIfd=6,R=602rules3.2由事务数据库挖掘布尔关联规则频繁项集性质向下封闭性质(Downwardclosure):频繁项集的任意子集都是频繁的•如果{beer,diaper,nuts}是频繁的,则{beer,diaper}也是频繁的•任何含有{beer,diaper,nuts}的事物数据也包含{beer,diaper}Apriori修剪原理:非频繁项集的任意超集都是非频繁的,无需生成与测试3.2由事务数据库挖掘布尔关联规则FoundtobeInfrequentnullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDEnullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDEPrunedsupersetsClosedPatternsandMax-PatternsAlongpatterncontainsacombinatorialnumberofsub-patterns,e.g.,{a1,…,a100}contains(1001)+(1002)+…+(110000)=2100–1=1.27*1030sub-patterns!Solution:Mineclosedpatternsandmax-patternsinsteadAnitemsetXisclosedifXisfrequentandthereexistsnosuper-patternYכX,withthesamesupportasXAnitemsetXisamax-patternifXisfrequentandthereexistsnofrequentsuper-patternYכXClosedpatternisalosslesscompressionoffreq.patterns•Reducingthe#ofpatternsandrules3.2由事务数据库挖掘布尔关联规则ClosedPatternsandMax-Patternsamaximalitemsetisacloseditemsetbutacloseditemsetisnotnecessarilyamaximalitemset.3.2由事务数据库挖掘布尔关联规则ClosedPatternsandMax-PatternsExercise.DB={a1,…,a100,a1,…,a50}•Min_sup=1.Whatisthesetofcloseditemset?•a1,…,a100:?•a1,…,a50:?Whatisthesetofmax-pattern?•a1,…,a100:?Whatisthesetofallpatterns?•!!3.2由事务数据库挖掘布尔关联规则3.2由事务数据库挖掘布尔关联规则Apriori算法Ck:候选k-项集Lk:频繁k-项集Join阶段:Ck由Lk-1链接而成Prune阶段:任何包含非频繁的(k-1)-项集都不可能是频繁的k-项集3.2由事务数据库挖掘布尔关联规则Apriori算法1)L1={large1-itemsets};2)for(k=2;Lk-1Ø;k++)dobegin3)Ck=apriori-gen(Lk-1);//Newcandidates4)foralltransactionstDdobegin5)Ct=subset(Ck,t);//Candidatescontainedint6)forallcandidatescCtdo7)c.count++;8)end9)Lk={cCk|c.countminsup}10)end11)Answer=kLk;3.2由事务数据库挖掘布尔关联规则Apriori候选集生成apriori-gen函数以所有的(k–1)项集的集合Lk–1为输入,返回所有k-项集的集合Ck.首先,在join阶段:insertintoCkselectp.item1,p.item2,…,p.itemk–1,q.itemk–1fromLk–1p,Lk–1qwherep.item1=q.item1,…,p.itemk–2=q.itemk–2,p.itemk–1q.itemk–1其次,在prune阶段,删除所有的满足(k–1)子集不在Lk–1的项集cCk:forallitemsetscCkdoforall(k–1)-subsetssofcdoif(sLk–1)thendeletecfromCk;3.2由事务数据库挖掘布尔关联规则L3={abc,abd,acd,ace,bcd}Self-joining:L3*L3abcdfromabcandabdacdefromacdandacePruning:acdeisremovedbecauseadeisnotinL3C4={abcd}3.2由事务数据库挖掘布尔关联规则Apriori算法举例TIDItems100134200235300123540025DatabaseDitemsetsup.{1}2{2}3{3}3{4}1{5}3itemsetsup.{1}2{2}3{3}3{5}3ScanDC1L1itemset{12}{13}{15}{23}{25}{35}itemsetsup{12}1{13}2{15}1{23}2{25}3{35}2itemsetsup{13}2{23}2{25}3{35}2L2C2C2ScanDC3L3itemset{235}ScanDitemsetsup{235}2由频繁项集产生关联规则同时满足最小支持度和最小置信度的才是强关联规则,从频繁项集产生的规则都满足支持度要求,而其置信度则可由以下公式计算:每个关联规则可由如下过程产生:•对于每个频繁项集l,产生l的所有非空子集;•对于每个非空子集s,如果则输出规则“”3.2由事务数据库挖掘布尔关联规则)(_sup)(_sup)|()(AcountportBAcountportBAPBAconfidenceconfscountportlcountportmin_)(_sup)(_sup)(sls如果{A,B,C,D}频繁项集,候选规则为:ABCD,ABDC,ACDB,BCDA,ABCD,BACD,CABD,DABCABCD,ACBD,ADBC,BCAD,BDAC,CDAB如果|L|=k,那么有2k–2候选规则(忽略LandL)候选规则天生满足最小支持度,关键是需要判断是否满足最小置信度3.2由事务数据库挖掘布尔关联规则3.2由事务数据库挖掘布尔关联规则如何有效地从频繁项集生成关联规则一般来说,置信度不具有反单调性c(ABCD)canbelargerorsmallerthanc(ABD)有相同项集生成的规则的置信
本文标题:第三章关联规则挖掘4new
链接地址:https://www.777doc.com/doc-2121041 .html