您好,欢迎访问三七文档
Hsin-HsiChen4-1Chapter4QueryLanguageHsin-HsiChenDepartmentofComputerScienceandInformationEngineeringNationalTaiwanUniversityHsin-HsiChen4-2Introduction•Goals–Whichqueriescanbeformulated–Howtheformulationisrelatedtounderlyinginformationretrievalmodels•QuerylanguagesHsin-HsiChen4-3BooleanqueriesFuzzyBooleanstructuredqueriesproximityphraseswordserrorssubstringsprefixessuffixesregularexpressionsextendedpatternsnaturallanguagekeywordsandcontextpatternmatchingbasicqueriesHsin-HsiChen4-4Keyword-BasedQuerying•single-wordqueries–Aqueryisformulatedbyaword–Adocumentisformulatedbylongsequencesofwords.–Awordisasequenceofletterssurroundedbyseparators–Whatarelettersandseparators?•e.g.,‘on-line’–Chinesesentencesarecomposedofcharacterswithoutwordboundaries–Thedivisionofthetextintowordsisnotarbitrary(ThistopicwillbedealtwithinaspecialtalkforChineseIR)Hsin-HsiChen4-5斷詞問題•問題–中文句子詞與詞之間並沒有明顯的分隔記號。–這名記者會說國語。•這名記者會說國語。•這名記者會說國語。•詞的定義–具有獨立意義,且扮演特定語法功能的字串應視為一個詞。•分詞標準–中國大陸【信息處理用現代漢語分詞規範】•1989年制定•1993年呈報國家標準Hsin-HsiChen4-6斷詞問題(續)–台灣【資訊處理用中文分詞標準草案】•1996年中華民國計算語言學學會草擬•基本原則–語義無法由組合成分直接相加而得之字串,應該分為一分詞單位。例如:撞期vs撞山–詞類無法由組合成分直接得到,應該合為一分詞單位。例如:好喝Hsin-HsiChen4-7處理模式•詞典是不可缺少的重要資源–列出“所有”可能的詞•把他的確實行動作了分析把,他,的,確實,實行,行動,動作,了,分析•電子計算機是會計算題目的機器電子,計算,計算機,電子計算機,是,會,會計,計算,計算題,題目,目的,的,機器–wordlattice電子計算機是會計算題目的機器Hsin-HsiChen4-8處理模式(續)•歧義排除機置–挑出最佳組合–策略•規則式–長詞優先台灣大學是有名的學府長詞遮蔽短詞:這名記者會說國語。–除去造成路徑中斷的詞區段–經驗法則:偏好三字詞,...–剖析器•統計式–馬可夫模型,鬆弛法,...–效能─各家都宣稱有百分之九十五以上的準確率Hsin-HsiChen4-9處理模式(續)•問題所在–詞典是否收錄所有可能的詞?•A-錢,凍蒜–策略•構詞率•(半)自動建立新的詞典•未知詞處理模式Hsin-HsiChen4-10構詞率•數詞與量詞的形成–一個個,一條條•日期與時間–八十五年十月四日•名詞或動詞的前綴或後綴–學生們•特殊動詞–丟丟看,吃吃看,寫寫看–高高興興,歡歡喜喜,漂漂亮亮,迷迷糊糊–打打球,跑跑步,寫寫字•...Hsin-HsiChen4-11ContextQueries•definition–Searchwordsinagivencontext,e.g.,nearotherwords•types–phrase•asequenceofsingle-wordqueries•e.g.,enhanceretrieval–proximity•asequenceofsinglewordsorphrases,andamaximumalloweddistancebetweenthemarespecified•e.g.,withindistance(enhance,retrieval,4)willmatch‘…enhancethepowerofretrieval…’Hsin-HsiChen4-12BooleanQueries•definition–Asyntaxcomposedofatomsthatretrievedocuments,andofBooleanoperatorswhichworkontheiroperands–e.g.,translationANDsyntaxORsyntacticANDtranslationORsyntaxsyntacticquerysyntaxtreeHsin-HsiChen4-13BooleanQueries(Continued)•operands–(e1ORe2)•Selectalldocumentswhichsatisfye1ore2.Duplicatesareeliminated.–(e1ANDe2)•Selectalldocumentswhichsatisfybothe1ande2.–(e1BUTe2)•Selectalldocumentswhichsatisfye1butnote2•“fuzzyboolean”–Retrievedocumentsappearinginsomeoperands(TheANDmayrequireittoappearinmoreoperandsthantheOR)Hsin-HsiChen4-14NaturalLanguage•generalizationof“fuzzyBoolean”•Aqueryisanenumerationofwordsandcontextqueries.•Allthedocumentsmatchingaportionoftheuserqueryareretrieved.Hsin-HsiChen4-15PatternMatching•Apatternisasetofsyntacticfeaturesthatmustoccurinatextsegment•types–words–prefixes,e.g.,‘comput’‘computer’,‘computation’,‘computing’,etc.–suffixes,e.g,‘ters’‘computers’,‘testers’,‘painters’,etc.–substrings,e.g.,‘tal’‘coastal’,‘talk’,‘metallic’,etc.–Ranges(lexicographicorder),between‘held’and‘hold’‘hoax’and‘hissingHsin-HsiChen4-16PatternMatching(Continued)–allowingerrors•Retrievealltextwordswhichare‘similar’tothegivenword•editdistance:theminimumnumberofcharacterinsertions,deletions,andreplacementsneededtomaketwostringsequal,e.g.,‘flower’and‘flower’•maximumallowededitdistance:queryspecifiesthemaximumnumberofallowederrorsforawordtomatchthepatternHsin-HsiChen4-17PatternMatching(Continued)–regularexpressions•union:ife1ande2areregularexpressions,then(e1|e2)matcheswhate1ore2matches•concatenation:ife1ande2areregularexpressions,theoccurrencesof(e1e2)areformedbytheoccurrencesofe1immediatelyfollowedbythoseofe2•repetition:ifeisaregularexpression,then(e*)matchesasequenceofzeroormorecontiguousoccurrenceofe.•‘pro(blem|tein)(s|)(0|1|2)*’‘problem2’and‘proteins’Hsin-HsiChen4-18PatternMatching(Continued)–extendedpatterns•subsetsoftheregularexpressionsexpressedwithasimplersyntax•classesofcharacters•conditionalexpressions•wildcharacterswhichmatchanysequenceinthetext•combinationsHsin-HsiChen4-19StructuralQueries•mixingcontentsandstructureinqueries–contents:words,phrases,orpatterns–structuralconstraints:containment,proximity,orotherrestrictionsonstructuralelements•issues–whatstructureatextmayhave–whatqueriescanbemadeonwhichstructures•threemainstructures–form-likefixedstructure–hypertextstructure–hierarchicalstructureHsin-HsiChen4-20Form-likefixedstructureDocument:afixedsetoffieldsForexample,amailhasasender,areceiver,adate,asubjectandabodyfield.Searchforthemailssenttoagivenpersonwith“football”intheSubjectfieldfieldstexttexttexttextHsin-HsiChen4-21HypertextstructureAhypertextisadirectedgraphwherenodesholdsometextthelinksrepresentconnectionsbetweennodesorbetweenpositionsinsidenodes(textcontents)(structuralconnectivity)WebGlimpse:combinebrowsingandsearchingontheWebHsin-HsiChen4-22WebGlimpse(•WebGlimpseisafast,flexiblesearchengineforfindinginformationinarelatedwebofpages.•Theabilitytoindexpagesonremotesitesprovidesalevelofpoweronestep
本文标题:断词问题.
链接地址:https://www.777doc.com/doc-3243424 .html