您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 质量控制/管理 > 华南理工大学《信息检索与web挖掘》复习资料
华南理工大学《信息检索与web挖掘》复习资料IR01.【IRTask】Given:1)Acorpusoftextualnatural-languagedocuments.2)Auserqueryintheformofatextualstring.Find:Arankedsetofdocumentsthatarerelevant【term-documentmatrix】【Termfrequencyvectors】【TF-IDF】TF:termfrequency,therawfrequencyofaterm𝑘𝑖insideadocument𝑑𝑗IDF:inversedocumentfrequency,theinverseofthefrequencyofaterm𝑘𝑖amongthedocumentsinthecollection【QueryVector】【Cosinesimilarity】Denominatoristhelengthofavector:【normalizedTF-IDF】【RetrievalExample】Query:contaminatedretrieval【AdvantageofVSM】(1)提高了检索的性能。很多文档用向量模型比用布尔模型能得到更加正确的结果。(2)部分匹配的策略使检索结果更接近用户需求。(3)可根据相似度对文档进行排序。【DisadvantageofVSM】(1)损失了语义、语法信息。(2)假设关键词是相互独立的,而实际上索引项有一定关联(3)不能像布尔模型一样使用逻辑关系表示查询请求。【InvertedFiles】【PostingsFile】Kindsofcontentinit:(1)Booleanretrieval:Justthedocumentnumber(2)RankedRetrieval:Documentnumberandtermweight(TF*IDF,...)(3)Proximityoperators:Wordoffsetsforeachoccurrenceoftheterm【BooleanQueries】OR,AND,BUT.(Takeunionintersectiondifference)MergingsteptakeO(x+y)opeartions,sopostingsshouldbesortedbydocID.【PhrasalSearch】短语查找算法:通过AND的方式找到包含短语内所有单词的文档集初始化结果集为空集对于每个文档:为每个单词创建一个位置记录向量挑出位置记录最短的单词(TF最低)作为基准对于该词,考察它每一次出现中,短语其他词是否在它旁边若符合要求,则加入到结果集【AutomaticEvaluationModel】【Set-BasedEffectivenessMeasures】Precision(查准率):HowmuchofwhatwasfoundisrelevantRecall(查全率):Howmuchofwhatisrelevantwasfound【ComputingRecall/PrecisionPoints】【InterpolatingaRecall/PrecisionCurve】Interpolateat11standardrecalllevels(0.0,0.1,…,1.0)Precisionatthej-thlevelismaximumknownprecisionatanyrecalllevelgreaterthatthej-thlevel:【MeanAveragePrecision(MAP)】Meanofaverageprecisionformanyqueries.Twomaintypes:(1)Micro-average(微平均)-averageoverallqueries(eachrelevantdocumentisapointintheaverage)(2)Macro-average(宏平均)-averageofwithin-queryprecision/recall(eachqueryisapointintheaverage)【R-Precision】R=#ofrelevantdocsR-Precision=precisionattheR-positioninrankingresult【Precision@N】Meanprecisionatafixednumberofdocuments.@10and@20aremostoftenusedforwebsearch.【MeanReciprocalRank,MRR】Meanofthereciprocalranks(排序倒数:正确答案序号的倒数)overallthetopics.例如对于查询1,正确答案排第2;查询2排第4,那么:MRR=(1/2+1/4)÷2=3/8【DiscountedCumulativeGain,DCG】Usesgradedrelevanceasameasureoftheusefulness,orgain,fromexaminingadocumentDCGisthetotalgainaccumulatedataparticularrankp:or【NormalizedDCG,NDCG】normalizedDCGbycomparingperfectrankingDCGPerfectranking:DCGforperfectranking(idealDCGvalues):NDCG(actualDCG÷idealDCG)【EvaluationBenchmarks】IR02.【a】(4)Bc【a】(5)Bc【a】(6)Bc【a】(7)Bc【a】(8)Bc【a】(9)Bc【a】(10)Bc【a】(11)Bc【a】(12)Bc【a】(13)Bc【a】(14)Bc【a】(15)Bc【a】(16)Bc【a】(17)Bc【a】(18)Bc【a】(19)Bc【a】(20)Bc【a】(21)Bc【a】(22)Bc【a】(23)Bc【a】(24)Bc【a】(25)Bc【a】(26)Bc【a】(27)Bc【a】(28)Bc【a】(29)Bc【a】(30)Bc【a】(31)Bc【a】(32)Bc
本文标题:华南理工大学《信息检索与web挖掘》复习资料
链接地址:https://www.777doc.com/doc-2591683 .html