您好,欢迎访问三七文档
ParallelBifold:Large-ScaleParallelPatternMiningwithConstraintsMohammadEl-Hajj,OsmarR.Za¨ıaneDepartmentofComputingScience,UofA,Edmonton,AB,Canada{mohammad,zaiane}@cs.ualberta.caUniversityOfAlbertaAbstract.Whencomputationallyfeasible,mininghugedatabasespro-ducestremendouslylargenumbersoffrequentpatterns.Inmanycases,itisimpracticaltominethosedatasetsduetotheirsheersize;notonlytheextentoftheexistingpatterns,butmainlythemagnitudeofthesearchspace.Manyapproacheshavesuggestedtheuseofconstraintstoapplytothepatternsorsearchingforfrequentpatternsinparallel.Sofar,thoseapproachesarestillnotgenuinelyeffectivetomineextremelylargedatasets.Weproposeamethodthatcombinesbothstrategiesefficiently,i.e.min-inginparallelforthesetofpatternswhilepushingconstraints.Usingthisapproachwecouldminesignificantlylargedatasets;withsizesneverreportedintheliteraturebefore.Weareabletoeffectivelydiscoverfre-quentpatternsinadatabasemadeofbilliontransactionsusinga32processorsclusterinlessthan2hours.1IntroductionFrequentItemsetMining(FIM)isakeycomponentofmanyalgorithmswhichex-tractpatternsfromtransactionaldatabases.Forexample,FIMcanbeleveragedtoproduceassociationrules,clusters,classifiersorcontrastsets.Thiscapabilityprovidesastrategicresourcefordecisionsupport,andismostcommonlyusedformarketbasketanalysis.Onechallengeforfrequentitemsetminingisthepotentiallyhugenumberofextractedpatterns,whichcaneclipsetheoriginaldatabaseinsize.Inadditiontoincreasingthecostofmining,thismakesitmoredifficultforuserstofindthevaluablepatterns.Introducingconstraintstotheminingprocesshelpsmitigatebothissues.Decisionmakerscanrestrictdiscoveredpatternsaccordingtospec-ifiedrules.Byapplyingtheserestrictionsasearlyaspossible,thecostofminingcanbeconstrained.Forexample,usersmaybeinterestedinpurchaseswhosetotalpriceexceeds$100,orwhoseitemscostbetween$50and$100.Whilediscoveringhiddenknowledgeintheavailablerepositoriesofdataisanimportantgoalfordecisionmakers,discoveringthisknowledgeina“reasonable”timeiscapital.Despitetheincreaseindatacollection,therapidityofthepatterndiscoveryremainsvitalandwillalwaysbeessential.Speedinguptheprocessofknowledgediscoveryhasbecomeacriticalproblem,andparallelismisshowntoDistributedandParalleldatabases(Springer)200620:225-2432beapotentialsolutionforsuchascalabilitypredicament.Naturally,paralleliza-tionisnottheonlyandshouldnotbethefirstsolutiontospeedupthedataminingprocess.Indeed,otherapproachesmighthelpinachievingthisgoal,suchassampling,attributeselection,restrictionofsearchspace,andalgorithmorcodeoptimization[15].Someoftheseapproachesmightbeusedinconjunctionwithparallelismtoachievethedesiredspeedup.Alegitimateissueiswhetherparallelismisneededindatamining.Efficiencyiscrucialinknowledgediscoverysystems,andwiththeexplosivegrowthofdatacollection,sequentialdatamin-ingalgorithmshavebecomeanunacceptablesolutiontomostrealsizeproblemsevenaftercleveroptimizations.Toillustratethecomplexityoftheproblemoffrequentitemsetenumerationintoday’srealdata,assumeasmalltokencasewithonly5possibleitems(i.e.astorethatsellsonly5distinctproducts),thelatticethatrepresentsallpossiblecandidatefrequentpatternshas25−1=31itemsets.Applicationsthatgeneratetransactionswithsizesgreaterthan100itemspertransactionarecommon.Inthosecases,tofindafrequentitemsetwithsize100,itwouldtakeasearchspaceof2100−1=1.27∗1030itemsets.Addingthefactthatmostrealtransactionaldatabasesareintheorderofmil-lions,ifnotbillions,oftransactionsandtheproblembecomesintractablewithcurrentsequentialsolutions.Withhundredsofgigabytes,andoftenterabytesandthousandsofdistinctitems,itisunrealisticforoneprocessortominethedatasequentially,especiallywhenmultiplepassesovertheseenormousdatabasesarerequired.Therearedifferentdesignissuesthataffectbuildingparallelfrequentminingalgorithms[35,34].Thesedesignissuesaresignificantlyaffectedbythespecifi-cationoftheproblemthatthesystemistryingtosolve.Constraintbasedminingisanongoingareaofresearchwheretwoimportantcategoriesofconstraintsmonotoneandanti-monotone[20]arestudiedinthiswork.Anti-monotoneconstraintsareconstraintsthatwhenvalidforapattern,theyareconsequentiallyvalidforanysubsetsubsumedbythepattern.Mono-toneconstraintswhenvalidforapatternareinevitablyvalidforanysupersetsubsumingthatpattern.Thestraightforwardwaytodealwithconstraintsistousethemasafilterpost-mining.Howeveritismoreefficienttoconsidertheconstraintsduringtheminingprocess.Thisiswhatisrefereedtoas“pushingtheconstraints”[24].Mostexistingalgorithmsleverage(orpush)oneofthesetypesduringminingandpostponetheothertoapost-processingphase.1.1ProblemStatementTheproblemofminingassociationrulesovermarketbasketanalysiswasintro-ducedin[1,2].Theproblemconsistsoffindingassociationsbetweenitemsoritemsetsintransactionaldata.Thedataistypicallyretailsalesintheformofcustomertransactions,butcanbeanydatathatcanbemodeledintotransac-tions.Forexamplemedicalimageswhereeachimageismodeledbyatransactionofvisualfeaturesfromtheimage[33],ortextdatawhereeachdocumentismod-eledbyatransactionrepresentingabagofwords[14],orwebaccessdatawhereclick-streamvisitationismodeledbysetsoftransactions[1
本文标题:Parallel Bifold Large-Scale Parallel Pattern Minin
链接地址:https://www.777doc.com/doc-3879237 .html