您好,欢迎访问三七文档
当前位置:首页 > IT计算机/网络 > 数据挖掘与识别 > ch04-数据流挖掘1
MiningofMassiveDatasetsJureLeskovec,AnandRajaraman,JeffUllmanStanfordUniversity::MiningofMassiveDatasets,Inmanydataminingsituations,wedonotknowtheentiredatasetinadvanceStreamManagementisimportantwhentheinputrateiscontrolledexternally:GooglequeriesTwitterorFacebookstatusupdatesWecanthinkofthedataasinfiniteandnon-stationary(thedistributionchangesovertime)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Inputelementsenteratarapidrate,atoneormoreinputports(i.e.,streams)WecallelementsofthestreamtuplesThesystemcannotstoretheentirestreamaccessiblyQ:Howdoyoumakecriticalcalculationsaboutthestreamusingalimitedamountof(secondary)memory?J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,StochasticGradientDescent(SGD)isanexampleofastreamalgorithmInMachineLearningwecallthis:OnlineLearningAllowsformodelingproblemswherewehaveacontinuousstreamofdataWewantanalgorithmtolearnfromitandslowlyadapttothechangesindataIdea:DoslowupdatestothemodelSGD(SVM,Perceptron)makessmallupdatesSo:Firsttraintheclassifierontrainingdata.Then:Foreveryexamplefromthestream,weslightlyupdatethemodel(usingsmalllearningrate)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,:MiningofMassiveDatasets,Typesofqueriesonewantsonansweronadatastream:(we’lldothesetoday)SamplingdatafromastreamConstructarandomsampleQueriesoverslidingwindowsNumberofitemsoftypexinthelastkelementsofthestreamJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Typesofqueriesonewantsonansweronadatastream:(we’lldothesenexttime)FilteringadatastreamSelectelementswithpropertyxfromthestreamCountingdistinctelementsNumberofdistinctelementsinthelastkelementsofthestreamEstimatingmomentsEstimateavg./std.dev.oflastkelementsFindingfrequentelementsJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,MiningquerystreamsGooglewantstoknowwhatqueriesaremorefrequenttodaythanyesterdayMiningclickstreamsYahoowantstoknowwhichofitspagesaregettinganunusualnumberofhitsinthepasthourMiningsocialnetworknewsfeedsE.g.,lookfortrendingtopicsonTwitter,FacebookJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,SensorNetworksManysensorsfeedingintoacentralcontrollerTelephonecallrecordsDatafeedsintocustomerbillsaswellassettlementsbetweentelephonecompaniesIPpacketsmonitoredataswitchGatherinformationforoptimalroutingDetectdenial-of-serviceattacksJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Sincewecannotstoretheentirestream,oneobviousapproachistostoreasampleTwodifferentproblems:(1)Sampleafixedproportionofelementsinthestream(say1in10)(2)MaintainarandomsampleoffixedsizeoverapotentiallyinfinitestreamAtany“time”kwewouldlikearandomsampleofselementsWhatisthepropertyofthesamplewewanttomaintain?Foralltimestepsk,eachofkelementsseensofarhasequalprob.ofbeingsampledJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Problem1:SamplingfixedproportionScenario:SearchenginequerystreamStreamoftuples:(user,query,time)Answerquestionssuchas:HowoftendidauserrunthesamequeryinasingledaysHavespacetostore1/10thofquerystreamNaïvesolution:Generatearandomintegerin[0..9]foreachqueryStorethequeryiftheintegeris0,otherwisediscardJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Simplequestion:Whatfractionofqueriesbyanaveragesearchengineuserareduplicates?Supposeeachuserissuesxqueriesonceanddqueriestwice(totalofx+2dqueries)Correctanswer:d/(x+d)Proposedsolution:Wekeep10%ofthequeriesSamplewillcontainx/10ofthesingletonqueriesand2d/10oftheduplicatequeriesatleastonceButonlyd/100pairsofduplicatesd/100=1/10∙1/10∙dOfd“duplicates”18d/100appearexactlyonce18d/100=((1/10∙9/10)+(9/10∙1/10))∙dSothesample-basedansweris𝑑100𝑥10+𝑑100+18𝑑100=𝒅𝟏𝟎𝒙+𝟏𝟗𝒅J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Pick1/10thofusersandtakealltheirsearchesinthesampleUseahashfunctionthathashestheusernameoruseriduniformlyinto10buc
本文标题:ch04-数据流挖掘1
链接地址:https://www.777doc.com/doc-2904665 .html