您好,欢迎访问三七文档
当前位置:首页 > IT计算机/网络 > 数据挖掘与识别 > stanford大学-大数据挖掘-web mining overview2
CS345ADataMiningLecture1IntroductiontoWebMiningWhatisWebMining?DiscoveringusefulinformationfromtheWorld-WideWebanditsusagepatternsWebMiningv.DataMiningStructure(orlackofit)TextualinformationandlinkagestructureScaleDatageneratedperdayiscomparabletolargestconventionaldatawarehousesSpeedOftenneedtoreacttoevolvingusagepatternsinreal-time(e.g.,merchandising)WebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesWebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesSizeoftheWebNumberofpagesTechnically,infiniteMuchduplication(30-40%)Bestestimateof“unique”staticHTMLpagescomesfromsearchengineclaimsUntillastyear,Googleclaimed8billion(?),Yahooclaimed20billionGooglerecentlyannouncedthattheirindexcontains1trillionpagesHowtoexplainthediscrepancy?ThewebasagraphPages=nodes,hyperlinks=edgesIgnorecontentDirectedgraphHighlinkage10-20links/pageonaveragePower-lawdegreedistributionStructureofWebgraphLet’stakeacloserlookatstructureBroderetal(2000)studiedacrawlof200MpagesandothersmallercrawlsBow-tiestructureNota“smallworld”Bow-tieStructureSource:Broderetal,2000Whatcanthegraphtellus?Distinguish“important”pagesfromunimportantonesPagerankDiscovercommunitiesofrelatedpagesHubsandAuthoritiesDetectwebspamTrustrankWebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesPower-lawdegreedistributionSource:Broderetal,2000Power-lawsgaloreStructureIn-degreesOut-degreesNumberofpagespersiteUsagepatternsNumberofvisitorsPopularitye.g.,products,movies,musicTheLongTailSource:ChrisAnderson(2004)TheLongTailShelfspaceisascarcecommodityfortraditionalretailersAlso:TVnetworks,movietheaters,…Thewebenablesnear-zero-costdisseminationofinformationaboutproductsMorechoicenecessitatesbetterfiltersRecommendationengines(e.g.,Amazon)HowIntoThinAirmadeTouchingtheVoidabestsellerWebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesExtractingStructuredDataWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesAdsvs.searchresultsAdsvs.searchresultsSearchadvertisingistherevenuemodelMulti-billion-dollarindustryAdvertiserspayforclicksontheiradsInterestingproblemsWhatadstoshowforasearch?IfI’manadvertiser,whichsearchtermsshouldIbidonandhowmuchtobid?WebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesTwoApproachestoAnalyzingDataMachineLearningapproachEmphasizessophisticatedalgorithmse.g.,SupportVectorMachinesDatasetstendtobesmall,fitinmemoryDataMiningapproachEmphasizesbigdatasets(e.g.,intheterabytes)Datacannotevenfitonasingledisk!NecessarilyleadstosimpleralgorithmsPhilosophyInmanycases,addingmoredataleadstobetterresultsthatimprovingalgorithmsNetflixGooglesearchGoogleadsMoreonmyblog:Datawocky(datawocky.com)SystemsarchitectureMemoryDiskCPUMachineLearning,Statistics“Classical”DataMiningVeryLarge-ScaleDataMiningMemDiskCPUMemDiskCPUMemDiskCPU…ClusterofcommoditynodesSystemsIssuesWebdatasetscanbeverylargeTenstohundredsofterabytesCannotmineonasingleserver!NeedlargefarmsofserversHowtoorganizehardware/softwaretominemulti-terabyedatasetsWithoutbreakingthebank!WebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesProjectLotsofinterestingprojectideasIfyoucan’tthinkofonepleasecomediscusswithusInfrastructureAsterDataclusteronAmazonEC2SupportsbothMapReduceandSQLDataNetflixShareThisGoogleWebBaseTREC
本文标题:stanford大学-大数据挖掘-web mining overview2
链接地址:https://www.777doc.com/doc-3278123 .html