您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 管理学资料 > datastreammethods
DataStreamMethodsGrahamCormodegraham@dimacs.rutgers.eduS.Muthukrishnanmuthu@cs.rutgers.eduPlanofattackFrequentItems/HeavyHittersCountingDistinctElementsClusteringitemsinStreamsWhatarewegoingtodotoday?Frequentitems–Classicalalgorithms–Lowerbounds–Newapproachesforthisproblemanditsextensions.SomequestionsWeseealargenumberofindividualtransactions(suchasAmazonbooksales)–Whatarethetopsellerstoday?Wearemonitoringnetworktraffic–Whichhosts/subnetsareresponsibleformostofthetraffic?(datacollectionmaybedistributed)Wehaveanetworkofsatellitesmonitoringeventsoverlargeareas–Whichareasareexperiencingthemostactivityoveraweek/day/hour?TheproblemWewillimaginethatweareobservingastreamofdataThisstreamconsistsofasequenceofintegersintherange1...Uforsomeupperbound,UWewillbeinterestedinwhichintegersoccurmostfrequentlywithinthestreamExample:1,5,8,9,10,3,23542341234,15,289,31516,23,8,3571,1251,8,5,124,289,17,15,23542341234,126,1251,5,…Whocares?Doesanyonereallycareaboutwhichintegersoccurfrequently?Well,doesanyonereallycareaboutsortingintegers?NO!But…PeoplecareaboutsortingrecordsindexedbyintegersPeoplecareaboutsortingrecordswithkeysthatcanbecompared(names,SS-ids.Phone#s,etc.)Willstudytheproblemwithintegers,butreallyweareinterestedinIPaddresses,booktitles,geography,whateverWhatifwedon’thaveintegers?Wecanusuallymapitemstointegers–(732)445-4580!7,324,454,580–128.6.75.111!111+256(75+256(6+256*128)))=2147896175(32bits)–“Graham”!6bytes(48bits)Whataboutlongnames,like“Shanmugavelayutham”?Oraddresses,words,images?HashfunctionsWecanfindsomefunctionthatmapslargeitemstointegersThesemaybeexact(one-to-one),ortheymaymapmultipleitemstothesamenumberWewouldlikethatthechancesofthisaresmall–Example1:takejustfirstfewcharactersofname–Example2:adduptheASCIIvalueofcharacters–Example3:useMD5checksumWewillencounterthiskindoffunctionlater,inmoredetail…BacktoFrequentItemsWewanttofindwhichitemsoccurmostfrequently,so…Let’ssetupanarray,initializedtozeroAndaddonetoentryiwhenweseeiinthestreamThenfindwhichitemshavethehighestcountOK,we’redone,let’sgohome...HoldonaminuteThiswon’twork...IfwearetrackingIPaddresses,thenourarrayneedstohave232entries=4GbIfwearelookingatpairsofIPaddresses,thenweneed16,384Pb(1Pb=220Gb)Evenifwecansparethememory,scanningthearraytofindthefrequentitemswilltakealongtime.EgScanningAmazon’stransactionsforadaytofindpopularitemsMaybeimportanttoanswerthesequestionsquickly,withsmallspace,evenifnotapurestreamscenarioTimelineofCSWewilldescribeaclassicalalgorithmforthisproblem2000-2003Recent1995-2000LastCentury1990-1995CivilWarEra1980-1990MiddleAges1960-1980ClassicalTimesBefore1960Prehistorical?196019801990199520002003MJRTYBoyer,Moore,1982&Fischer,Salzberg1982FindaMajorityItem,ifoneexists–InitializeCounterto0–ForeachiteminthestreamIfthecounteriszero,puttheiteminthebucket,setcounterto1Else,ifthenextitem=iteminbucket,add1tocountElsesubtract1fromcount3Count:2ExamplesofMJRTYAMajorityitemisonethatoccursmorethanhalfthetimeStream1122223113113113111131Bucket1111222211111111111111Counter1210121010121232345656Whataboutthefollowingstream?12341234123412312...Thereisnomajority,butthemethodwillreturnsomeitemasacandidateTheMajorityisalwaysrightIfthereisnoMajorityelement,thenthealgorithmwillreturnsomeitemwithnoguaranteesaboutitsfrequency.Puttheotherway,thealgorithmwillfindanyitemwhichoccursmorethanhalfofthetime.Howcanwebesure?WewillprovethepropertiesofthealgorithmProvingthepropertiesImaginepairingupitems:everynon-majorityitemispairedupwithanon-majorityitem.ThentherewillbesomemajorityitemsleftoverEachpaircancelsout–weaddandsubtracttothecounterOverall,weendmustendupwithapositivecountforthemajorityitemnomatterhowwearrangetheitemsGeneralizingthemethodThemajorityguaranteestofindanyitemthatoccursmorethan½ofthetimeWegeneralizeittofindanyitemthatoccurswithfrequencygreaterthan1/(k+1)byusingkcountersandkbuckets:FrequentForeachiteminthestream–Iftheitemisinoneofthebuckets,increaseitscountbyone–Else,ifthereisabucketwithcountzero,putitinthatbucketandsetitscountto1–Else,reducethecountsofeverybucketby1ProvingtheclaimsClaim:iftherearekitemswhichoccurwithfrequencygreaterthan1/(k+1),thenwewillfindthemallHowlikelyisitthatthiswilloccurfortypicaldatasets?Cantherebedatasetswithmorethankitemswithfrequencygreaterthan1/(k+1)?Strongerclaim:everyitemwithfrequencygreaterthan1/(k+1)willbeincludedinabucketHW:Proveboththeseclaims.Howlongdoesittaketoprocesseachiteminthestream?Howlongdoesittaketoreturnthecandidatefrequentitems?RecentWorkTherehasbeenalotofactivityonthisprobleminthelastyear:–MankuandMotwani,VLDB2002–Charikar,Chen,Farach-Colton,ICALP2002–Demaine,Lopez-Ortiz,Munro,ESA2002–Karp,Papadimitriou,Shenker,ACMTODS2003–C,Muthukrishnan,PODS2003–Severalothersinprogress...Papersonfrequentitemsareafrequentitem!ExtensionstotheQuestionTherearemanyvariationsoftheproblemtoconsider:Whatifyouseeonestream,andIseeanotherstream.Howcanwecombineourobservationstofinditemsthatarefrequentintheunionofthestreams?Whatifitemsarriveanddepart?Whatifwewanttoknow(sub)setsthatoccurredtogether
本文标题:datastreammethods
链接地址:https://www.777doc.com/doc-4989 .html