您好,欢迎访问三七文档
当前位置:首页 > 行业资料 > 其它行业文档 > Google云计算解决方案3_元数据应用
GoogleClusterComputingFacultyTrainingWorkshopModuleIII:NutchMeta-detailsBuilttoencouragepublicsearchworkOpen-source,w/pluggablemodulesCheaptorun,bothmachines&adminsGoal:Searchmorepages,withbetterquality,thananyotherenginePrettygoodrankingHasdone~200Mpages,morepossibleHadoopisaspinoffOutlineNutchdesignLinkdatabase,fetcher,indexer,etc…HadoopsupportDistributedfilesystem,jobcontrolWebDBFetcher2ofNFetcher1ofNFetcher0ofNFetchlist2ofNFetchlist1ofNFetchlist0ofNUpdate2ofNUpdate1ofNUpdate0ofNContent0ofNContent0ofNContent0ofNIndexer2ofNIndexer1ofNIndexer0ofNSearcher2ofNSearcher1ofNSearcher0ofNWebServer2ofMWebServer1ofMWebServer0ofMIndex2ofNIndex1ofNIndex0ofNInjectMovingPartsAcquisitioncycleWebDBFetcherIndexgenerationIndexingLinkanalysis(maybe)ServingresultsWebDBContainsinfoonallpages,linksURL,lastdownload,#failures,linkscore,contenthash,refcountingSourcehash,targetURLMustalwaysbeconsistentDesignedtominimizediskseeks19msseektimex200mnewpages/mo=~44daysofdiskseeks!Single-diskWebDBwashugeheadacheFetcherFetcherisverystupid.Nota“crawler”Pre-MapRed:divide“to-fetchlist”intokpieces,oneforeachfetchermachineURLsforonedomaingotosamelist,otherwiserandom“Politeness”w/ointer-fetcherprotocolsCanobserverobots.txtsimilarlyBetterDNS,robotscachingEasyparallelismTwooutputs:pages,WebDBedits2.Sortedits(externally,ifnecessary)WebDB/FetcherUpdatesURL::NeverContentHash:NoneURL::NeverContentHash:NoneURL::4/07/05ContentHash:MD5_toewkekqmekkalekaaURL::3/22/05ContentHash:MD5_sdflkjweroiwelksdEdit:DOWNLOAD_CONTENTURL::MD5_balboglerropewolefbagEdit:DOWNLOAD_CONTENTURL::MD5_toewkekqmekkalekaaEdit:NEW_LINKURL::NoneWebDBFetcheredits1.Writedownfetcheredits3.Readstreamsinparallel,emittingnewdatabase4.RepeatforothertablesURL::Today!ContentHash:MD5_balboglerropewolefbagURL::Today!ContentHash:MD5_toewkekqmekkalekaaIndexingIteratethroughallkpagesetsinparallel,constructinginvertedindexCreatesa“searchabledocument”of:URLtextContenttextIncominganchortextOthercontenttypesmighthaveadifferentdocumentfieldsEg,emailhassender/receiverAnysearchablefieldend-userwillwantUsesLucenetextindexerLinkanalysisApage’srelevancedependsonbothintrinsicandextrinsicfactorsIntrinsic:pagetitle,URL,textExtrinsic:anchortext,linkgraphPageRankismostfamousofmanyOthersinclude:HITSOPICSimpleincominglinkcountLinkanalysisissexy,butimportancegenerallyoverstatedLinkanalysis(2)NutchperformsanalysisinWebDBEmitascoreforeachknownpageAtindextime,incorporatescoreintoinvertedindexExtremelytime-consumingInourcase,disk-consuming,too(becausewewanttouselow-memorymachines)Fastandeasy:0.5*log(#incominglinks)“britney”QueryProcessingDocs0-1MDocs1-2MDocs2-3MDocs3-4MDocs4-5MDs2.3M,2.9M1.2M,4.4M,29,…AdministeringNutchAdmincostsarecriticalIt’sahasslewhenyouhave25machinesGooglehas100k,probablymoreFilesWebDBcontent,workingfilesFetchlists,fetchedpagesLinkanalysisoutputs,workingfilesInvertedindicesJobsEmitfetchlists,fetch,updateWebDBRunlinkanalysisBuildinvertedindicesAdministeringNutch(2)Adminsoundsboring,butit’snot!ReallyIswearLarge-filemaintenanceGoogleFileSystem(Ghemawat,Gobioff,Leung)NutchDistributedFileSystemJobControlMap/Reduce(DeanandGhemawat)Pig(YahooResearch)DataStorage(BigTable)NutchDistributedFileSystemSimilar,butnotidentical,toGFSRequirementsarefairlystrangeExtremelylargefilesMostfilesreadonce,fromstarttoendLowadmincostsperGBEquallystrangedesignWrite-once,withdeleteSinglefilecanexistacrossmanymachinesWhollyautomaticfailurerecoveryNDFS(2)DatadividedintoblocksBlockscanbecopied,replicatedDatanodesholdandserveblocksNamenodeholdsmetainfoFilenameblocklistBlockdatanode-locationDatanodesreportintonamenodeeveryfewsecondsNDFSFileReadNamenodeDatanode0Datanode1Datanode2Datanode3Datanode4Datanode51.Clientasksdatanodeforfilenameinfo2.Namenoderespondswithblocklist,andlocation(s)foreachblock3.Clientfetcheseachblock,insequence,fromadatanode“crawl.txt”(block-33/datanodes1,4)(block-95/datanodes0,2)(block-65/datanodes1,4,5)NDFSReplicationNamenodeDatanode0(33,95)Datanode1(46,95)Datanode2(33,104)Datanode3(21,33,46)Datanode4(90)Datanode5(21,90,104)1.Alwayskeepatleastkcopiesofeachblk2.Imaginedatanode4dies;blk90lost3.Namenodelosesheartbeat,decrementsblk90’sreferencecount.Asksdatanode5toreplicateblk90todatanode04.ChoosingreplicationtargetistrickyMap/ReduceMap/ReduceisprogrammingmodelfromLisp(andotherplaces)EasytodistributeacrossnodesNiceretry/failuresemanticsmap(key,val)isrunoneachiteminsetemitskey/valpairsreduce(key,vals)isrunforeachuniquekeyemittedbymap()emitsfinaloutputManyproblemscanbephrasedthiswayMap/Reduce(2)Task:countwordsindocsInputconsistsof(url,contents)pa
本文标题:Google云计算解决方案3_元数据应用
链接地址:https://www.777doc.com/doc-6469698 .html