您好,欢迎访问三七文档
DataMiningAssociationAnalysis:BasicConceptsandAlgorithmsLectureNotesforChapter6IntroductiontoDataMiningbyTan,Steinbach,Kumar©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20041©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20042AssociationRuleMiningzGivenasetoftransactions,findrulesthatwillpredicttheoccurrenceofanitembasedontheoccurrencesofotheritemsinthetransactionMarket-BaskettransactionsTIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,CokeExampleofAssociationRules{Diaper}→{Beer},{Milk,Bread}→{Eggs,Coke},{Beer,Bread}→{Milk},Implicationmeansco-occurrence,notcausality!©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20043Definition:FrequentItemsetzItemset–AcollectionofoneormoreitemsExample:{Milk,Bread,Diaper}–k-itemsetAnitemsetthatcontainskitemszSupportcount(σ)–Frequencyofoccurrenceofanitemset–E.g.σ({Milk,Bread,Diaper})=2zSupport–Fractionoftransactionsthatcontainanitemset–E.g.s({Milk,Bread,Diaper})=2/5zFrequentItemset–AnitemsetwhosesupportisgreaterthanorequaltoaminsupthresholdTIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,Coke©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20044Definition:AssociationRuleExample:Beer}Diaper,Milk{⇒4.052|T|)BeerDiaper,,Milk(===σs67.032)Diaper,Milk()BeerDiaper,Milk,(===σσczAssociationRule–AnimplicationexpressionoftheformX→Y,whereXandYareitemsets–Example:{Milk,Diaper}→{Beer}zRuleEvaluationMetrics–Support(s)FractionoftransactionsthatcontainbothXandY–Confidence(c)MeasureshowoftenitemsinYappearintransactionsthatcontainXTIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,Coke©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20045AssociationRuleMiningTaskzGivenasetoftransactionsT,thegoalofassociationruleminingistofindallruleshaving–support≥minsupthreshold–confidence≥minconfthresholdzBrute-forceapproach:–Listallpossibleassociationrules–Computethesupportandconfidenceforeachrule–Prunerulesthatfailtheminsupandminconfthresholds⇒Computationallyprohibitive!©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20046MiningAssociationRulesExampleofRules:{Milk,Diaper}→{Beer}(s=0.4,c=0.67){Milk,Beer}→{Diaper}(s=0.4,c=1.0){Diaper,Beer}→{Milk}(s=0.4,c=0.67){Beer}→{Milk,Diaper}(s=0.4,c=0.67){Diaper}→{Milk,Beer}(s=0.4,c=0.5){Milk}→{Diaper,Beer}(s=0.4,c=0.5)TIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,CokeObservations:•Alltheaboverulesarebinarypartitionsofthesameitemset:{Milk,Diaper,Beer}•Rulesoriginatingfromthesameitemsethaveidenticalsupportbutcanhavedifferentconfidence•Thus,wemaydecouplethesupportandconfidencerequirements©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20047MiningAssociationRuleszTwo-stepapproach:1.FrequentItemsetGeneration–Generateallitemsetswhosesupport≥minsup2.RuleGeneration–Generatehighconfidencerulesfromeachfrequentitemset,whereeachruleisabinarypartitioningofafrequentitemsetzFrequentitemsetgenerationisstillcomputationallyexpensive©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20048FrequentItemsetGenerationnullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDEGivenditems,thereare2dpossiblecandidateitemsets©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20049FrequentItemsetGenerationzBrute-forceapproach:–Eachitemsetinthelatticeisacandidatefrequentitemset–Countthesupportofeachcandidatebyscanningthedatabase–Matcheachtransactionagainsteverycandidate–Complexity~O(NMw)=ExpensivesinceM=2d!!!TIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,CokeTransactions©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200410ComputationalComplexityzGivenduniqueitems:–Totalnumberofitemsets=2d–Totalnumberofpossibleassociationrules:1231111+−=⎥⎦⎤⎢⎣⎡⎟⎠⎞⎜⎝⎛−×⎟⎠⎞⎜⎝⎛=+−=−=∑∑dddkkdjjkdkdRIfd=6,R=602rules©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200411FrequentItemsetGenerationStrategieszReducethenumberofcandidates(M)–Completesearch:M=2d–UsepruningtechniquestoreduceMzReducethenumberoftransactions(N)–ReducesizeofNasthesizeofitemsetincreases–UsedbyDHPandvertical-basedminingalgorithmszReducethenumberofcomparisons(NM)–Useefficientdatastructurestostorethecandidatesortransactions–Noneedtomatcheverycandidateagainsteverytransaction©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200412ReducingNumberofCandidateszAprioriprinciple:–Ifanitemsetisfrequent,thenallofitssubsetsmustalsobefrequentzAprioriprincipleholdsduetothefollowingpropertyofthesupportmeasure:–Supportofanitemsetneverexceedsthesupportofitssubsets–Thisisknownastheanti-monotonepropertyofsupport)()()(:,YsXsYXYX≥⇒⊆∀©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200413FoundtobeInfrequentIllustratingAprioriPrinciplePrunedsupersets©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200414IllustratingAprioriPrincipleItemCountBread4Coke2Milk4Beer3Diaper4Eggs1ItemsetCount{Bread,Milk}3{Bread,Beer}2{Bread,Diaper}3{Milk,Beer}2{Milk,Diaper}3{Beer,Dia
本文标题:Introduction to data mining-lecture notes chap06_b
链接地址:https://www.777doc.com/doc-6131178 .html