您好,欢迎访问三七文档
搜索引擎技术闫宏飞,yhf@net.pku.edu.cn北京大学计算机系网络实验室2004年12月24日@CERNET2004内容提要•搜索引擎工作原理•信息检索相关研究和机构搜索引擎—WebSearchEngines•定义:允许用户递交查询,检索出与查询相关的网页结果列表,并且排序输出。•创建索引的方法–手工索引–自动索引•系统结构–集中式体系结构–分布式体系结构BrowsingServicesSearchEngineServicesWebPagesBagofWordsTwosemanticsextremesTwoserviceextremes??????搜索引擎三段式工作流程•搜集–批量搜集,增量式搜集;搜集目标,搜集策略•预处理–关键词提取;重复网页消除;链接分析;索引•服务–查询方式和匹配;结果排序;文档摘要搜集整理服务搜索引擎系统流程天网搜索引擎系统流程分布式Web搜集系统结构协调进程(节点)抓取进程协调进程(节点)抓取进程协调进程(节点)抓取进程调度模块……天网存储格式version:1.0//versionnumberurl:::Tue,15Apr200308:13:06GMT//timeofharvestip:162.105.129.12//IPaddressunzip-length:30233//Ifincluded,thedatamustbecompressedlength:18133//datalength//ablanklineXXXXXXXX//thefollowingsaredatapartXXXXXXXX….XXXXXXXX//dataend//insertanewlineIndexes•Whatshouldtheindexcontain?•Databasesystemsindexprimaryandsecondarykeys–Thisisthehybridapproach–Indexprovidesfastaccesstoasubsetofdatabaserecords–Scansubsettofindsolutionset•IRProblem:•Cannotpredictkeysthatpeoplewilluseinqueries–Everywordinadocumentisapotentialsearchterm•IRSolution:Indexbyallkeys(words)fulltextindexesIndexContents•Thecontentsdependupontheretrievalmodel•Featurepresence/absence–Boolean–Statistical(tf,df,ctf,doclen,maxtf)–Oftenabout10%thesizeoftherawdata,compressed•Positional–Featurelocationwithindocument–Granularitiesincludeword,sentence,paragraph,etc–Coarsegranularitiesarelessprecise,buttakelessspace–Word-levelgranularityabout20-30%thesizeoftherawdata,compressedIndexes:Implementation•Commonimplementationsofindexes–Bitmaps–Signaturefiles–Invertedfiles•Commonindexcomponents–Dictionary(lexicon)–Postings•documentids•wordpositionsNopositionaldataindexedInvertedFilesInvertedFilesWord-LevelInvertedFileInvertedSearchAlgorithm1.Findqueryelements(terms)inthelexicon2.Retrievepostingsforeachlexiconentry3.ManipulatepostingsaccordingtotheretrievalmodelWord-LevelInvertedFileQuery:1.porridge&pot(BOOL)2.―porridgepot‖(BOOL)3.porridgepot(VSM)lexiconpostingAnswer内容提要•搜索引擎工作原理•信息检索相关研究和机构ABriefhistoryofModernInformationRetrieval•In1945,VannevarBushpublishedAsWeMayThinkintheAtlanticmonthly.•Inthe1960s,theSMARTsystembyGerardSaltonandhisstudents•CranfieldevaluationsdonebyCyrilCleverdon•The1970sand1980ssawmanydevelopmentsbuiltontheadvancesofthe1960s.•In1992withtheinceptionofTextRetrievalConference.•Thealgorithmsdeveloped•ThealgorithmsdevelopedinIRwereemployedforsearchingtheWebfrom1996.Cluster\Year7178798081828384858687888990919293949596979899000102TotalDatabases,NLInterfaces8416510135252413112266General!5292957101061062586224314251126Models1211412121222223130Questionanswering1111111144117Syntacticphrases&SDR1112163323211211311137ConceptualIR,KBIR14413343575163532341321175Compression112211131112118Clustering21123321211211326Relevancefeedback1112111124312111125Invertedfiles&Implementations11121312111318Termweighting13212115331211111131Messageunderstanding&TDT1113234245531Filtering11111411112318HypertextIR,Multipleevidence131121222143152233Imageretrieval111112119Probabilistic&Languagemodels111313422321313334Boolean&extendedBoolean12111111110Japanese&ChineseIR1123231114DBMS&IR111115Users&Search2332243223133112138Visualisation111112112112Signaturefiles11122119DistributedIR1212113113421124Evaluation34421723834Topicdistillation&Linkageretrieval13329Latentsemanticindexing111216Textcategorisation133313133223Documentsummarisation2223312Crosslingual133113416ClusteringofSIGIRpapersbytopicvs.yearCluster\Year7178798081828384858687888990919293949596979899000102TotalDatabases,NLInterfaces8416510135252413112266General!5292957101061062586224314251126Models1211412121222223130Questionanswering1111111144117Syntacticphrases&SDR1112163323211211311137ConceptualIR,KBIR14413343575163532341321175Compression112211131112118Clustering21123321211211326Relevancefeedback1112111124312111125Invertedfiles&Implementations11121312111318Termweighting13212115331211111131Messageunderstanding&TDT1113234245531Filtering11111411112318HypertextIR,Multipleevidence131121222143152233Imageretrieval111112119Probabilistic&Languagemodels111313422321313334Boolean&extendedBoolean12111111110Japanese&ChineseIR1123231114DBMS&IR111115Users&Search2332243223133112138Visualisation111112112112Signaturefiles11122119DistributedIR1212113113421124Evaluation34421723834Topicdistillation&Linkageretrieval13329Latentsemanticindexing111216Textcategorisation133313133223Documentsummarisation2223312Crosslingual133113416QuestionansweringCluster\Year7178798081828384858687888990919293949596979899000102TotalDatabases,NLInterfaces8416510135252413112266General!5292957101061062586224314251126Models1211412121222223130Questionanswering1111111144117Syntacticphrases&SDR1112163323211211311137ConceptualIR,KBIR14413343575163532341321175Compression112211131112118Clustering21123321211211326Relevancefeedback1112111124312111125Invertedfiles&Implementations11121312111318Termweighting13212115331211111131Messageunderstanding&TDT1113234245531Filtering11111411112318HypertextIR,Multipleevidence131121222143152233Imageretrieval111112119
本文标题:搜索引擎技术
链接地址:https://www.777doc.com/doc-4874568 .html