您好,欢迎访问三七文档
Cluster-BasedPatternRecognitioninNaturalLanguageTextAthesissubmittedinpartialfulfillmentoftherequirementsforthedegreeofMasterofSciencebyShmuelBrodyunderthesupervisionofProf.NaftaliTishbyAugust2005iiAcknowledgementsIwouldliketothankmyadviserProf.Tishbyforhisguidanceandassistanceinproducingthiswork,andforhissuggestionsandpositiveinput.IwouldalsoliketothankBeataBeigmanKlebanovforherconstanthelpandadvicethroughoutthiswork,including(butdefinitelynotlimitedto)thecontributionoftheparseddatausedhere.Alsodeservingofthanksaremyfamily,fortheirsupport,andespeciallymygrandmother,RoseBrody,forherconfidenceinmyachievements.iiiAbstractThisworkpresentstheClusteredClausestructure,whichusesinformation-basedclusteringanddependenciesbetweensentencecomponentstoprovideasimplifiedandgeneralizedmodelofagrammaticalclause.Weshowthatthisrepresentation,whichisbasedondependencieswithinthesentence,enablesustodetectcomplextextualrelationsatahigherlevelofcontext.Therelationswedetectareofinterestinthemselves,aslinguisticphenomena,andarealsohighlysuitedforuseincertainlinguisticandcognitivetasks.Wedefineandsearchforseveraltypesofpatterns,movingfrombasicpatternstomorecomplexones,frompatternswithinthesentencetothoseinvolvingentiresentences.Examplesofrecognizedpatternsofeachtypearepresented,andalsodescriptionsofseveralinterestingphenomenadetectedbyourmethod.Weassessthequalityoftheresults,anddemonstratetheimportanceoftheclusteringanddependencymodelwechose.Theprinciplesbehindourmethodarelargelydomain-independent,andcanthereforebeappliedtootherformsofstructuredsequentialdataaswell.ivTableofContents1Introduction11.1TheProblem11.2Overview21.3RelatedPreviousWork21.3.1SyntaxandDistributionalInformationasMeasuresofSemantics21.3.2RelationsfromPatternsandTemplates41.3.3FeatureSetsandSimilarityMeasures51.3.4UsesofSimilarityandRelatednessMeasures71.3.5SemanticDatabases81.3.6RelationshipsInvolvingaHigherLevelofContext101.3.7PatternsContainingClusterUnits121.3.8NovelAspectsofthisWork131.4ImportanceandMotivation131.4.1Cognition&WorldKnowledgeAcquisition131.4.2AutomatedRuleAcquisition141.4.3QueryEnhancement151.4.4Implication&Entailment151.4.5AnaphoraResolution162TheWorkSetup172.1TheClauseModel172.1.1MINIPAR'sSentenceStructure172.1.2TheSimplifiedClauseStructure182.2TheClustering182.2.1ClusteringMethods182.3TheInformationBottleneckConcept192.3.1TheSequentialIBClusteringMethod212.3.2TheVariablesandUseoftheClauseModel232.4TheClustered-ClauseRepresentation242.5PatternDefinition242.6EvaluationMethod26v3TheProcedure273.1TheData273.2Preprocessing273.3Clustering283.4SimplePatternDetection293.5ComplexPatternDetection303.6ReducingGeneralizedPatternstoSpecificOnes303.7SignificanceCalculation314Results344.1ClusteringResults364.1.1TheClusters364.1.2EvaluatingtheQualityoftheClustering384.1.3TheClusteredClauses404.2Intra-ClausePatterns424.3Inter-ClausePatterns444.3.1PatternswithinThreeClauses(t=3)444.3.2Longer-RangePatterns(t=6,t=9)464.4ComplexPatterns484.5TheInfluenceofClustering505Discussion525.1Conclusions525.2OtherAreasofApplication525.3PossibleExtensionsandImprovements535.3.1Re-insertingRemovedWords535.3.2DifferentDataSet535.3.3DifferentEvaluationMethod545.3.4RicherSentenceModel546AppendixA–Clusteringresults556.1SubjectClusters556.2VerbClusters59vi6.3ObjectClusters617AppendixB-Intra-ClausePatterns667.1Subject–VerbPatterns667.2Verb-ObjectPatterns677.3Subject-ObjectPatterns687.4Word-WordPatterns697.4.1LanguagePhrasePatterns697.4.2WorldPatterns697.4.3PatternsResultingfromParserMisclassification697.4.4PatternsSpecifictotheCorpus698AppendixC-Inter-ClausePatterns708.1PatternswithinThreeClauses(t=3)709AppendixD–ComplexPatterns729.1PatternswithSubject-SubjectAnchor729.2PatternswithVerb-VerbAnchor749.3PatternswithObject–ObjectAnchor759.4PatternswithSubject-ObjectAnchor769.5PatternswithObject-SubjectAnchor77Bibliography7811Introduction1.1TheProblemThisworkisconcernedwiththeproblemofdetectingpatternsinsequentialdata.Whenwedealwithsequenceswhereeachpointinthesequencecanhaveoneofaverylargenumberofvalues,suchpatternsareoftendifficulttodetect.Thedifficultystemsmainlyfromtheproblemofdatasparseness,meaningthatoursequenceisnot(andusually,cannotfeasiblybe)longenoughtogiveatruerepresentationofthevaluedistribution.Thelargenumberofvaluespresentsanotherproblem:sinceweareusuallylookingforpatternswhichshouldbeapplicabletoalargepartofthedata,findingapatternwhichappliestoasmallnumberofvaluesisoflittleusetous.Theprocedurewepresenthereisdesignedtosolveboththeseproblems,andfacilitatethepatterndetectiontask.Itcombinestheuseofclusteringviamutualinformationwithamodelchosentofitbothourspecificpatterndetectiontaskandthedatawedealwith.Animportantexampleofsuchasituationispatterndetectionintext.Ifweviewthetextasasequenceofsentences,itiseasytoseethatfindingpatternsbetweensentencesisverydifficult.Wehardlyeverencounterthesameexactsentencemorethanonce,soatfirstglance,nopatternsexistbetweenwholesentences.Wearewellaware,however,thatpatternsdoexist,butonthesemanticlevel,ratherthanthepurelylexicalone.Sincesemanticinf
本文标题:Cluster-Based Pattern Recognition in Natural Langu
链接地址:https://www.777doc.com/doc-6276987 .html