ch04-数据流挖掘1

MiningofMassiveDatasetsJureLeskovec,AnandRajaraman,JeffUllmanStanfordUniversity::MiningofMassiveDatasets,Inmanydataminingsituations,wedonotknowtheentiredatasetinadvanceStreamManagementisimportantwhentheinputrateiscontrolledexternally:GooglequeriesTwitterorFacebookstatusupdatesWecanthinkofthedataasinfiniteandnon-stationary(thedistributionchangesovertime)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Inputelementsenteratarapidrate,atoneormoreinputports(i.e.,streams)WecallelementsofthestreamtuplesThesystemcannotstoretheentirestreamaccessiblyQ:Howdoyoumakecriticalcalculationsaboutthestreamusingalimitedamountof(secondary)memory?J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,StochasticGradientDescent(SGD)isanexampleofastreamalgorithmInMachineLearningwecallthis:OnlineLearningAllowsformodelingproblemswherewehaveacontinuousstreamofdataWewantanalgorithmtolearnfromitandslowlyadapttothechangesindataIdea:DoslowupdatestothemodelSGD(SVM,Perceptron)makessmallupdatesSo:Firsttraintheclassifierontrainingdata.Then:Foreveryexamplefromthestream,weslightlyupdatethemodel(usingsmalllearningrate)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,:MiningofMassiveDatasets,Typesofqueriesonewantsonansweronadatastream:(we’lldothesetoday)SamplingdatafromastreamConstructarandomsampleQueriesoverslidingwindowsNumberofitemsoftypexinthelastkelementsofthestreamJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Typesofqueriesonewantsonansweronadatastream:(we’lldothesenexttime)FilteringadatastreamSelectelementswithpropertyxfromthestreamCountingdistinctelementsNumberofdistinctelementsinthelastkelementsofthestreamEstimatingmomentsEstimateavg./std.dev.oflastkelementsFindingfrequentelementsJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,MiningquerystreamsGooglewantstoknowwhatqueriesaremorefrequenttodaythanyesterdayMiningclickstreamsYahoowantstoknowwhichofitspagesaregettinganunusualnumberofhitsinthepasthourMiningsocialnetworknewsfeedsE.g.,lookfortrendingtopicsonTwitter,FacebookJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,SensorNetworksManysensorsfeedingintoacentralcontrollerTelephonecallrecordsDatafeedsintocustomerbillsaswellassettlementsbetweentelephonecompaniesIPpacketsmonitoredataswitchGatherinformationforoptimalroutingDetectdenial-of-serviceattacksJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Sincewecannotstoretheentirestream,oneobviousapproachistostoreasampleTwodifferentproblems:(1)Sampleafixedproportionofelementsinthestream(say1in10)(2)MaintainarandomsampleoffixedsizeoverapotentiallyinfinitestreamAtany“time”kwewouldlikearandomsampleofselementsWhatisthepropertyofthesamplewewanttomaintain?Foralltimestepsk,eachofkelementsseensofarhasequalprob.ofbeingsampledJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Problem1:SamplingfixedproportionScenario:SearchenginequerystreamStreamoftuples:(user,query,time)Answerquestionssuchas:HowoftendidauserrunthesamequeryinasingledaysHavespacetostore1/10thofquerystreamNaïvesolution:Generatearandomintegerin[0..9]foreachqueryStorethequeryiftheintegeris0,otherwisediscardJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Simplequestion:Whatfractionofqueriesbyanaveragesearchengineuserareduplicates?Supposeeachuserissuesxqueriesonceanddqueriestwice(totalofx+2dqueries)Correctanswer:d/(x+d)Proposedsolution:Wekeep10%ofthequeriesSamplewillcontainx/10ofthesingletonqueriesand2d/10oftheduplicatequeriesatleastonceButonlyd/100pairsofduplicatesd/100=1/10∙1/10∙dOfd“duplicates”18d/100appearexactlyonce18d/100=((1/10∙9/10)+(9/10∙1/10))∙dSothesample-basedansweris𝑑100𝑥10+𝑑100+18𝑑100=𝒅𝟏𝟎𝒙+𝟏𝟗𝒅J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Pick1/10thofusersandtakealltheirsearchesinthesampleUseahashfunctionthathashestheusernameoruseriduniformlyinto10buc

ch04-数据流挖掘1

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

物业小区巡查和装修管理培训

软件工程试卷

金融理财基础知识培训61091420

种植制度

福特上市会试驾会方案-场地方版

后备干部执行力培训

伊利舒化奶传播策略

第七章定价策略第二节第二课时

人才发展合作协议书

施工组织设计编制讲解-按照标准化

相关文档

相关搜索