您好,欢迎访问三七文档
当前位置:首页 > 电子/通信 > 数据通信与网络 > 搜索引擎分析_Google(中英对照)
TheAnatomyofaLarge-ScaleHypertextualWebSearchEngineSergeyBrinandLawrencePage{sergey,page}@cs.stanford.eduComputerScienceDepartment,StanfordUniversity,Stanford,CA94305AbstractInthispaper,wepresentGoogle,aprototypeofalarge-scalesearchenginewhichmakesheavyuseofthestructurepresentinhypertext.GoogleisdesignedtocrawlandindextheWebefficientlyandproducemuchmoresatisfyingsearchresultsthanexistingsystems.Theprototypewithafulltextandhyperlinkdatabaseofatleast24millionpagesisavailableat:WorldWideWeb,SearchEngines,InformationRetrieval,PageRank,Google1.Introduction(Note:Therearetwoversionsofthispaper--alongerfullversionandashorterprintedversion.ThefullversionisavailableonthewebandtheconferenceCD-ROM.)Thewebcreatesnewchallengesforinformationretrieval.Theamountofinformationonthewebisgrowingrapidly,aswellasthenumberofnewusersinexperiencedintheartofwebresearch.Peoplearelikelytosurfthewebusingitslinkgraph,oftenstartingwithhighqualityhumanmaintainedindicessuchasYahoo!orwithsearchengines.Humanmaintainedlistscoverpopulartopicseffectivelybutaresubjective,expensivetobuildandmaintain,slowtoimprove,andcannotcoverallesoterictopics.Automatedsearchenginesthatrelyonkeywordmatchingusuallyreturntoomanylowqualitymatches.Tomakemattersworse,someadvertisersattempttogainpeople'sattentionbytakingmeasuresmeanttomisleadautomatedsearchengines.Wehavebuiltalarge-scalesearchenginewhichaddressesmanyoftheproblemsofexistingsystems.Itmakesespeciallyheavyuseoftheadditionalstructurepresentinhypertexttoprovidemuchhigherqualitysearchresults.Wechoseoursystemname,Google,becauseitisacommonspellingofgoogol,or10100andfitswellwithourgoalofbuildingverylarge-scalesearchengines.1.1WebSearchEngines--ScalingUp:1994-2000Searchenginetechnologyhashadtoscaledramaticallytokeepupwiththegrowthoftheweb.In1994,oneofthefirstwebsearchengines,theWorldWideWebWorm()[McBryan94]hadanindexof110,000webpagesandwebaccessibledocuments.AsofNovember,1997,thetopsearchenginesclaimtoindexfrom2million(WebCrawler)to100millionwebdocuments(fromSearchEngineWatch).Itisforeseeablethatbytheyear2000,acomprehensiveindexoftheWebwillcontainoverabilliondocuments.Atthesametime,thenumberofqueriessearchengineshandlehasgrownincrediblytoo.InMarchandApril1994,theWorldWideWebWormreceivedanaverageofabout1500queriesperday.InNovember1997,Altavistaclaimedithandledroughly20millionqueriesperday.Withtheincreasingnumberofusersontheweb,andautomatedsystemswhichquerysearchengines,itislikelythattopsearchengineswillhandlehundredsofmillionsofqueriesperdaybytheyear2000.Thegoalofoursystemistoaddressmanyoftheproblems,bothinqualityandscalability,introducedbyscalingsearchenginetechnologytosuchextraordinarynumbers.1.2.Google:ScalingwiththeWebCreatingasearchenginewhichscaleseventotoday'swebpresentsmanychallenges.Fastcrawlingtechnologyisneededtogatherthewebdocumentsandkeepthemuptodate.Storagespacemustbeusedefficientlytostoreindicesand,optionally,thedocumentsthemselves.Theindexingsystemmustprocesshundredsofgigabytesofdataefficiently.Queriesmustbehandledquickly,atarateofhundredstothousandspersecond.ThesetasksarebecomingincreasinglydifficultastheWebgrows.However,hardwareperformanceandcosthaveimproveddramaticallytopartiallyoffsetthedifficulty.Thereare,however,severalnotableexceptionstothisprogresssuchasdiskseektimeandoperatingsystemrobustness.IndesigningGoogle,wehaveconsideredboththerateofgrowthoftheWebandtechnologicalchanges.Googleisdesignedtoscalewelltoextremelylargedatasets.Itmakesefficientuseofstoragespacetostoretheindex.Itsdatastructuresareoptimizedforfastandefficientaccess(seesection4.2).Further,weexpectthatthecosttoindexandstoretextorHTMLwilleventuallydeclinerelativetotheamountthatwillbeavailable(seeAppendixB).ThiswillresultinfavorablescalingpropertiesforcentralizedsystemslikeGoogle.1.3DesignGoals1.3.1ImprovedSearchQualityOurmaingoalistoimprovethequalityofwebsearchengines.In1994,somepeoplebelievedthatacompletesearchindexwouldmakeitpossibletofindanythingeasily.AccordingtoBestoftheWeb1994--Navigators,Thebestnavigationserviceshouldmakeiteasytofindalmos
本文标题:搜索引擎分析_Google(中英对照)
链接地址:https://www.777doc.com/doc-5962470 .html