Parallel Bifold Large-Scale Parallel Pattern Minin

ParallelBifold:Large-ScaleParallelPatternMiningwithConstraintsMohammadEl-Hajj,OsmarR.Za¨ıaneDepartmentofComputingScience,UofA,Edmonton,AB,Canada{mohammad,zaiane}@cs.ualberta.caUniversityOfAlbertaAbstract.Whencomputationallyfeasible,mininghugedatabasespro-ducestremendouslylargenumbersoffrequentpatterns.Inmanycases,itisimpracticaltominethosedatasetsduetotheirsheersize;notonlytheextentoftheexistingpatterns,butmainlythemagnitudeofthesearchspace.Manyapproacheshavesuggestedtheuseofconstraintstoapplytothepatternsorsearchingforfrequentpatternsinparallel.Sofar,thoseapproachesarestillnotgenuinelyeﬀectivetomineextremelylargedatasets.Weproposeamethodthatcombinesbothstrategieseﬃciently,i.e.min-inginparallelforthesetofpatternswhilepushingconstraints.Usingthisapproachwecouldminesigniﬁcantlylargedatasets;withsizesneverreportedintheliteraturebefore.Weareabletoeﬀectivelydiscoverfre-quentpatternsinadatabasemadeofbilliontransactionsusinga32processorsclusterinlessthan2hours.1IntroductionFrequentItemsetMining(FIM)isakeycomponentofmanyalgorithmswhichex-tractpatternsfromtransactionaldatabases.Forexample,FIMcanbeleveragedtoproduceassociationrules,clusters,classiﬁersorcontrastsets.Thiscapabilityprovidesastrategicresourcefordecisionsupport,andismostcommonlyusedformarketbasketanalysis.Onechallengeforfrequentitemsetminingisthepotentiallyhugenumberofextractedpatterns,whichcaneclipsetheoriginaldatabaseinsize.Inadditiontoincreasingthecostofmining,thismakesitmorediﬃcultforuserstoﬁndthevaluablepatterns.Introducingconstraintstotheminingprocesshelpsmitigatebothissues.Decisionmakerscanrestrictdiscoveredpatternsaccordingtospec-iﬁedrules.Byapplyingtheserestrictionsasearlyaspossible,thecostofminingcanbeconstrained.Forexample,usersmaybeinterestedinpurchaseswhosetotalpriceexceeds$100,orwhoseitemscostbetween$50and$100.Whilediscoveringhiddenknowledgeintheavailablerepositoriesofdataisanimportantgoalfordecisionmakers,discoveringthisknowledgeina“reasonable”timeiscapital.Despitetheincreaseindatacollection,therapidityofthepatterndiscoveryremainsvitalandwillalwaysbeessential.Speedinguptheprocessofknowledgediscoveryhasbecomeacriticalproblem,andparallelismisshowntoDistributedandParalleldatabases(Springer)200620:225-2432beapotentialsolutionforsuchascalabilitypredicament.Naturally,paralleliza-tionisnottheonlyandshouldnotbetheﬁrstsolutiontospeedupthedataminingprocess.Indeed,otherapproachesmighthelpinachievingthisgoal,suchassampling,attributeselection,restrictionofsearchspace,andalgorithmorcodeoptimization[15].Someoftheseapproachesmightbeusedinconjunctionwithparallelismtoachievethedesiredspeedup.Alegitimateissueiswhetherparallelismisneededindatamining.Eﬃciencyiscrucialinknowledgediscoverysystems,andwiththeexplosivegrowthofdatacollection,sequentialdatamin-ingalgorithmshavebecomeanunacceptablesolutiontomostrealsizeproblemsevenaftercleveroptimizations.Toillustratethecomplexityoftheproblemoffrequentitemsetenumerationintoday’srealdata,assumeasmalltokencasewithonly5possibleitems(i.e.astorethatsellsonly5distinctproducts),thelatticethatrepresentsallpossiblecandidatefrequentpatternshas25−1=31itemsets.Applicationsthatgeneratetransactionswithsizesgreaterthan100itemspertransactionarecommon.Inthosecases,toﬁndafrequentitemsetwithsize100,itwouldtakeasearchspaceof2100−1=1.27∗1030itemsets.Addingthefactthatmostrealtransactionaldatabasesareintheorderofmil-lions,ifnotbillions,oftransactionsandtheproblembecomesintractablewithcurrentsequentialsolutions.Withhundredsofgigabytes,andoftenterabytesandthousandsofdistinctitems,itisunrealisticforoneprocessortominethedatasequentially,especiallywhenmultiplepassesovertheseenormousdatabasesarerequired.Therearediﬀerentdesignissuesthataﬀectbuildingparallelfrequentminingalgorithms[35,34].Thesedesignissuesaresigniﬁcantlyaﬀectedbythespeciﬁ-cationoftheproblemthatthesystemistryingtosolve.Constraintbasedminingisanongoingareaofresearchwheretwoimportantcategoriesofconstraintsmonotoneandanti-monotone[20]arestudiedinthiswork.Anti-monotoneconstraintsareconstraintsthatwhenvalidforapattern,theyareconsequentiallyvalidforanysubsetsubsumedbythepattern.Mono-toneconstraintswhenvalidforapatternareinevitablyvalidforanysupersetsubsumingthatpattern.Thestraightforwardwaytodealwithconstraintsistousethemasaﬁlterpost-mining.Howeveritismoreeﬃcienttoconsidertheconstraintsduringtheminingprocess.Thisiswhatisrefereedtoas“pushingtheconstraints”[24].Mostexistingalgorithmsleverage(orpush)oneofthesetypesduringminingandpostponetheothertoapost-processingphase.1.1ProblemStatementTheproblemofminingassociationrulesovermarketbasketanalysiswasintro-ducedin[1,2].Theproblemconsistsofﬁndingassociationsbetweenitemsoritemsetsintransactionaldata.Thedataistypicallyretailsalesintheformofcustomertransactions,butcanbeanydatathatcanbemodeledintotransac-tions.Forexamplemedicalimageswhereeachimageismodeledbyatransactionofvisualfeaturesfromtheimage[33],ortextdatawhereeachdocumentismod-eledbyatransactionrepresentingabagofwords[14],orwebaccessdatawhereclick-streamvisitationismodeledbysetsoftransactions[1

Parallel Bifold Large-Scale Parallel Pattern Minin

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

运输单据

第五篇金融工程的未来前景

实现产品的策划程序(1)

深圳经济特区住宅区物业管理条例doc17(1)

彩涂板介绍及年产40万吨彩涂板项目分析

7行政部职位说明书

banquet新员工培训手册-副本111

事业部新员工安全培训（PPT74页)

会计学基础总论

小学英语教师经验交流材料

相关文档

相关搜索

Parallel Bifold Large-Scale Parallel Pattern Minin

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

运输单据

第五篇 金融工程的未来前景

实现产品的策划程序(1)

深圳经济特区住宅区物业管理条例doc17(1)

彩涂板介绍及年产40万吨彩涂板项目分析

7行政部职位说明书

banquet新员工培训手册-副本111

事业部新员工安全培训（PPT74页)

会计学基础总论

小学英语教师经验交流材料

相关文档

相关搜索

第五篇金融工程的未来前景