您好,欢迎访问三七文档
当前位置:首页 > IT计算机/网络 > 电子商务 > TextMining14-XML
1半结构化文本挖掘杨建武Email:yangjw@pku.edu.cn第十四章:北京大学计算机科学技术研究所文本挖掘技术(2012春)2Text-centricXMLretrievalDocumentsmarkedupasXMLE.g.,assemblymanuals,journalissuesQueriesareuserinformationneedsE.g.,givemetheSection(element)ofthedocumentthattellsmehowtochangeabrakelightBookChaptersSectionsSubsectionsWorldWideWebThisisonlyonlyanothertolookoneletoshowtheneedanlaaoutstructureofandmoreadocumentandsoasstoitdoenotnecessarytextastructureddocumenthaveretrievalonthewebisanitimportanttopicoftoday’sresearchitissuestomakeselastsentence..3ConceptualmodelStructureddocumentsContent+structureInvertedfile+structureindextf,idf,…Matchingcontent+structurePresentationofrelatedcomponentsDocumentsQueryDocumentrepresentationRetrievalresultsQueryrepresentationIndexingFormulationRetrievalfunctionRelevancefeedback4Approaches…vectorspacemodelprobabilisticmodelBayesiannetworklanguagemodelextendingDBmodelBooleanmodelnaturallanguageprocessingcognitivemodelontologyparameterestimationtuningsmoothingfusionphrasetermstatisticscollectionstatisticscomponentstatisticsproximitysearchlogisticregressionbeliefmodelrelevancefeedbackdivergencefromrandomnessmachinelearning5elementlanguagemodelcollectionlanguagemodelsmoothingparameterelementscoreelementsizeelementscorearticlescorequeryexpansionwithblindfeedbackignoreelementswith20termshighvalueofleadstoincreaseinsizeofretrievedelementsresultswith=0.9,0.5and0.2similarrankelement(UniversityofAmsterdam,INEX2003)Languagemodel6Vectorspacemodelarticleindexabstractindexsectionindexsub-sectionindexparagraphindexRSVnormalisedRSVRSVnormalisedRSVRSVnormalisedRSVRSVnormalisedRSVRSVnormalisedRSVmergetfandidfasforfixedandnon-nestedretrievalunits(IBMHaifa,INEX2003)7VectorspacesandXMLVectorspacestried+testedframeworkforkeywordretrievalOtherbagofwordsapplicationsintext:classification,clusteringFortext-centricXMLretrieval,canwemakeuseofvectorspaceideas?Challenge:capturethestructureofanXMLdocumentinthevectorspace.8VectorspacesandXMLForinstance,distinguishbetweenthefollowingtwocasesBillGatesMicrosoftBillWulfThePearlyGates9Content-richXML:representationBillMicrosoftWulfPearlyGatesGatesTheBill10EncodingtheGatesdifferentlyWhataretheaxesofthevectorspace?Intextretrieval,therewouldbeasingleaxisforGatesHerewemustseparateoutthetwooccurrences,underAuthorandTitleThus,axesmustrepresentnotonlyterms,butsomethingabouttheirpositioninanXMLtree11QueriesBeforeaddressingthis,letusconsiderthekindsofquerieswewanttohandleMicrosoftGatesBill12SubtreesandstructureConsiderallsubtreesofthedocumentthatincludeatleastonelexiconterm:BillMicrosoftGatesBillMicrosoftGatesMicrosoftBillGatesMicrosoftBillGates13Structuralterms:docs+queriesCalleachoftheresulting(8+,inthepreviousslide)subtreesastructuraltermCreateoneaxisinthevectorspaceforeachdistinctstructuraltermEachdocumentbecomesavectorinthespaceofstructuraltermsAquerytreecanlikewisebefactoredintostructuraltermsAndrepresentedasavectorAllowsweightingportionsofthequery14StructuraltermsWeightWeightsbasedonfrequenciesfornumberofoccurrences(justaswehadtf)Alltheusualissueswithterms(stemming?Casefolding?)remain15ExampleoftfweightingHerethestructuraltermscontainingtoorbewouldhavemoreweightthanthosethatdontTobeornottobebeornotto16Down-weightingForthedocontheleft:inastructuraltermrootedatthenodePlay,shouldntHamlethaveahighertfweightthanYorick?Idea:multiplytfcontributionofatermtoanodeklevelsupbyk,forsome1.AlaspoorYorickHamlet17Down-weightingexample,=0.8Forthedoconthepreviousslide,thetfofHamletismultipliedby0.8Yorickismultipliedby0.64inanystructuraltermrootedatPlay.18ThenumberofstructuraltermsCanbehuge!Impractical(不切实际的)tobuildavectorspaceindexwithsomanydimensionsWillexaminepragmatic(注重实效的)solutionstothisshortly;fornow,continuetobelieve19Restrictstructuralterms?Dependingontheapplication,wemayrestrictthestructuraltermsE.g.,mayneverwanttoreturnaTitlenode,onlyBookorPlaynodesSodontenumerate/index/retrieve/scorestructuraltermsrootedatsomenodesTwosolutionsQuery-timematerializationofaxesRestrictthekindsofsubtreestoamanageableset20Query-timematerializationHereweseekadocwithHamletinthetitleOnfindingthematchwecomputethecosinesimilarityscoreAfterallmatchesarefound,rankbysortingHamletAlaspoorYorickHamletInsteadofenumeratingallstructuraltermsofalldocs(andthequery),enumerateonlyforthequery21RestrictingthesubtreesEnumeratingallstructuralterms(subtrees)isprohibitive,forindexingMostsubtreesmayneverbeusedinprocessinganyqueryCanwegetawaywithindexingarestrictedclassofsubtreesIdeallyfocusonsubtreeslikelytoariseinqueriesOnlypathsincludingalexiconterm(IBMHaifa)22ExampleofaretrievalstepMatch=23XQuery24XQuerySQLforXMLUsagescenariosHuman-readabledocumentsData-orienteddocumentsMixeddocuments(e.g.,patientrecords)ReliesonXPathXMLSchemadatatypes25XQueryTheprincipalformsofXQueryexpressionsare:pathexpressionselementconstructorsFLWR(flower)expressionslistexpressionsconditionalexpressionsquantifiedexpressionsdatatypeexpressionsEvaluatedwithrespecttoacontext26FLWRFOR$pINdocument(bib.xml)//publish
本文标题:TextMining14-XML
链接地址:https://www.777doc.com/doc-6381877 .html