Cluster-Based Pattern Recognition in Natural Langu

Cluster-BasedPatternRecognitioninNaturalLanguageTextAthesissubmittedinpartialfulfillmentoftherequirementsforthedegreeofMasterofSciencebyShmuelBrodyunderthesupervisionofProf.NaftaliTishbyAugust2005iiAcknowledgementsIwouldliketothankmyadviserProf.Tishbyforhisguidanceandassistanceinproducingthiswork,andforhissuggestionsandpositiveinput.IwouldalsoliketothankBeataBeigmanKlebanovforherconstanthelpandadvicethroughoutthiswork,including(butdefinitelynotlimitedto)thecontributionoftheparseddatausedhere.Alsodeservingofthanksaremyfamily,fortheirsupport,andespeciallymygrandmother,RoseBrody,forherconfidenceinmyachievements.iiiAbstractThisworkpresentstheClusteredClausestructure,whichusesinformation-basedclusteringanddependenciesbetweensentencecomponentstoprovideasimplifiedandgeneralizedmodelofagrammaticalclause.Weshowthatthisrepresentation,whichisbasedondependencieswithinthesentence,enablesustodetectcomplextextualrelationsatahigherlevelofcontext.Therelationswedetectareofinterestinthemselves,aslinguisticphenomena,andarealsohighlysuitedforuseincertainlinguisticandcognitivetasks.Wedefineandsearchforseveraltypesofpatterns,movingfrombasicpatternstomorecomplexones,frompatternswithinthesentencetothoseinvolvingentiresentences.Examplesofrecognizedpatternsofeachtypearepresented,andalsodescriptionsofseveralinterestingphenomenadetectedbyourmethod.Weassessthequalityoftheresults,anddemonstratetheimportanceoftheclusteringanddependencymodelwechose.Theprinciplesbehindourmethodarelargelydomain-independent,andcanthereforebeappliedtootherformsofstructuredsequentialdataaswell.ivTableofContents1Introduction11.1TheProblem11.2Overview21.3RelatedPreviousWork21.3.1SyntaxandDistributionalInformationasMeasuresofSemantics21.3.2RelationsfromPatternsandTemplates41.3.3FeatureSetsandSimilarityMeasures51.3.4UsesofSimilarityandRelatednessMeasures71.3.5SemanticDatabases81.3.6RelationshipsInvolvingaHigherLevelofContext101.3.7PatternsContainingClusterUnits121.3.8NovelAspectsofthisWork131.4ImportanceandMotivation131.4.1Cognition&WorldKnowledgeAcquisition131.4.2AutomatedRuleAcquisition141.4.3QueryEnhancement151.4.4Implication&Entailment151.4.5AnaphoraResolution162TheWorkSetup172.1TheClauseModel172.1.1MINIPAR'sSentenceStructure172.1.2TheSimplifiedClauseStructure182.2TheClustering182.2.1ClusteringMethods182.3TheInformationBottleneckConcept192.3.1TheSequentialIBClusteringMethod212.3.2TheVariablesandUseoftheClauseModel232.4TheClustered-ClauseRepresentation242.5PatternDefinition242.6EvaluationMethod26v3TheProcedure273.1TheData273.2Preprocessing273.3Clustering283.4SimplePatternDetection293.5ComplexPatternDetection303.6ReducingGeneralizedPatternstoSpecificOnes303.7SignificanceCalculation314Results344.1ClusteringResults364.1.1TheClusters364.1.2EvaluatingtheQualityoftheClustering384.1.3TheClusteredClauses404.2Intra-ClausePatterns424.3Inter-ClausePatterns444.3.1PatternswithinThreeClauses(t=3)444.3.2Longer-RangePatterns(t=6,t=9)464.4ComplexPatterns484.5TheInfluenceofClustering505Discussion525.1Conclusions525.2OtherAreasofApplication525.3PossibleExtensionsandImprovements535.3.1Re-insertingRemovedWords535.3.2DifferentDataSet535.3.3DifferentEvaluationMethod545.3.4RicherSentenceModel546AppendixA–Clusteringresults556.1SubjectClusters556.2VerbClusters59vi6.3ObjectClusters617AppendixB-Intra-ClausePatterns667.1Subject–VerbPatterns667.2Verb-ObjectPatterns677.3Subject-ObjectPatterns687.4Word-WordPatterns697.4.1LanguagePhrasePatterns697.4.2WorldPatterns697.4.3PatternsResultingfromParserMisclassification697.4.4PatternsSpecifictotheCorpus698AppendixC-Inter-ClausePatterns708.1PatternswithinThreeClauses(t=3)709AppendixD–ComplexPatterns729.1PatternswithSubject-SubjectAnchor729.2PatternswithVerb-VerbAnchor749.3PatternswithObject–ObjectAnchor759.4PatternswithSubject-ObjectAnchor769.5PatternswithObject-SubjectAnchor77Bibliography7811Introduction1.1TheProblemThisworkisconcernedwiththeproblemofdetectingpatternsinsequentialdata.Whenwedealwithsequenceswhereeachpointinthesequencecanhaveoneofaverylargenumberofvalues,suchpatternsareoftendifficulttodetect.Thedifficultystemsmainlyfromtheproblemofdatasparseness,meaningthatoursequenceisnot(andusually,cannotfeasiblybe)longenoughtogiveatruerepresentationofthevaluedistribution.Thelargenumberofvaluespresentsanotherproblem:sinceweareusuallylookingforpatternswhichshouldbeapplicabletoalargepartofthedata,findingapatternwhichappliestoasmallnumberofvaluesisoflittleusetous.Theprocedurewepresenthereisdesignedtosolveboththeseproblems,andfacilitatethepatterndetectiontask.Itcombinestheuseofclusteringviamutualinformationwithamodelchosentofitbothourspecificpatterndetectiontaskandthedatawedealwith.Animportantexampleofsuchasituationispatterndetectionintext.Ifweviewthetextasasequenceofsentences,itiseasytoseethatfindingpatternsbetweensentencesisverydifficult.Wehardlyeverencounterthesameexactsentencemorethanonce,soatfirstglance,nopatternsexistbetweenwholesentences.Wearewellaware,however,thatpatternsdoexist,butonthesemanticlevel,ratherthanthepurelylexicalone.Sincesemanticinf

Cluster-Based Pattern Recognition in Natural Langu

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

机械工程导论轻工机械烟草机械制革机械食品机械

环境工程原理

湖南省农业厅关于进一步加强农作物种子标签管理的通知

旅游文化教学课件第二章中国传统文化概述

第六章旅游产品成本与价格

劳动法(就业和工资法律制度)

“年轻”系列老年人化妆品广告策划

第五讲投资银行的企业并购业务

【精品课件】环境保护与我们的生活

6投资决策实务

相关文档

相关搜索

Cluster-Based Pattern Recognition in Natural Langu

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

机械工程导论轻工机械烟草机械制革机械食品机械

环境工程原理

湖南省农业厅关于进一步加强农作物种子标签管理的通知

旅游文化 教学课件第二章 中国传统文化概述

第六章 旅游产品成本与价格

劳动法(就业和工资法律制度)

“年轻”系列老年人化妆品广告策划

第五讲投资银行的企业并购业务

【精品课件】环境保护与我们的生活

6投资决策实务

相关文档

相关搜索

旅游文化教学课件第二章中国传统文化概述

第六章旅游产品成本与价格