R 语言环境下的文本挖掘

1、TechnicalReport2012R语言环境下的文本挖掘TextMininginRVersion0.0220120321刘思喆主页更新联系方式sunbjt@gmail.com新浪微博@刘思喆Copyright©2012RandalltheContributorstoRtm.Allrightsreserved.R以及Rtm的作者拥有版权©2012。保留所有权利。Permissionisgrantedtocopy,distributeand/ormodifythisdocumentunderthetermsoftheGNUFreeDocumentationLicense,Version1.2oranylaterversionpublishedbytheFreeSoftwareFoundation;withtheInvariantSectionsbeingContributors,noFront-CoverTexts,andnoBack-CoverTexts.你可以拷贝、发布或者修改这份文档，但必须遵守自由软件组织颁布的GNU自由文档许可证1.2或者以后版本的条款。InvariantS。

2、ections包括Contributors，没有Front-CoverTexts和Back-CoverTexts。目录1文本挖掘介绍32自然语言处理技术述32.1相关的R包.......................................32.2stemming和Tokenization...............................52.3中文分词........................................63tm包63.1简介...........................................63.2数据读入........................................63.3数据输出........................................83.4语料库的提取......................................83.5信息转化........................................93.6转化为纯文本...............。

3、.......................93.6.1去除多余的空白.................................93.6.2小写变化....................................103.6.3停止词去除...................................103.6.4填充.......................................103.7过滤...........................................103.8元数据管理.......................................113.9标准操作和函数.....................................133.10创建词条-文档关系矩阵................................143.11对词条-文档关系矩阵操作...............................143.12字典....................................。

4、.......164网页解析的利器–XML包174.1网页解析........................................174.2字符集转化.......................................215XML同tm包的配合使用（todo）216一些文本挖掘方面的应用216.1基础分析技术......................................226.1.1文本聚类....................................226.1.2文本分类....................................236.2潜变量语义分析(notdone)..............................246.3主题模型（Topicmodel）...............................24目录用R语言做文本挖掘|2A附录26A.1关于XML文件.....................................26A.2关于正则表达式....................。

5、.................27用R语言做文本挖掘|31文本挖掘介绍文本挖掘被描述为“自动化或半自动化处理文本的过程”，包含了文档聚类、文档分类、自然语言处理、文体变化分析及网络挖掘等领域内容。对于文本处理过程首先要拥有分析的语料（textcorpus），比如报告、信函、出版物等。而后根据这些语料建立半结构化的文本库（textdatabase）。而后生成包含词频的结构化的词条-文档矩阵（term-documentmatrix）。图1:文本挖掘的处理流程这个一般性数据结构会被用于后续的分析，比如：•文本分类，比如根据现有的文本分类情况，对未知文本进行归类；•语法分析；•信息提取和修复；•文档信息汇总，比如提取相关有代表性的关键词、句子等。2自然语言处理技术述2.1相关的R包PhoneticsandSpeechProcessing:emuisacollectionoftoolsforthecreation,manipulation,andanalysisofspeechdatabases.AtthecoreofEMUisadatabasesearchenginewhichallows。

6、theresearchertoﬁndvariousspeechsegmentsbasedonthesequentialandhierarchicalstructureoftheutterancesinwhichtheyoccur.EMUincludesaninteractivelabellerwhichcandisplayspectrogramsandotherspeechwaveforms,andwhichallowsthecreationofhierarchical,aswellassequential,labelsforaspeechutterance.LexicalDatabases:wordnetprovidesanRinterfacetoWordNet,alargelexicaldatabaseofEnglish.2.1相关的R包用R语言做文本挖掘|4KeywordExtractionandGeneralStringManipulation:•R’sbasepackagealreadyprovidesarichsetofcharactermanipulationroutin。

7、es.•RKEAprovidesanRinterfacetoKEA(Version5.0).KEA(forKeyphraseExtractionAlgorithm)allowsforextractingkeyphrasesfromtextdocuments.Itcanbeeitherusedforfreeindexingorforindexingwithacontrolledvocabulary.•gsubfncanbeusedforcertainparsingtaskssuchasextractingwordsfromstringsbycontentratherthanbydelimiters.demo(”gsubfn-gries”)showsanexampleofthisinanaturallanguageprocessingcontext.•taucontainsbasicstringmanipulationandanalysisroutinesneededintextprocessingsuchasdealingwithcharacterencoding,language,pa。

8、tterncounting,andtokenization.NaturalLanguageProcessing:•openNLPprovidesanRinterfacetoOpenNLP,acollectionofnaturallanguageprocess-ingtoolsincludingasentencedetector,tokenizer,pos-tagger,shallowandfullsyntacticparser,andnamed-entitydetector,usingtheMaxentJavapackagefortrainingandusingmaximumentropymodels.•openNLPmodels.enshipstrainedmodelsforEnglishandopenNLPmodels.esforSpanishtobeusedwithopenNLP.•RWekaisainterfacetoWekawhichisacollectionofmachinelearningalgorithmsfordataminingtaskswritteninJava.。

9、Especiallyusefulinthecontextofnaturallanguageprocessingisitsfunctionalityfortokenizationandstemming.•SnowballprovidestheSnowballstemmerswhichcontainthePorterstemmerandseveralotherstemmersfordiﬀerentlanguages.SeetheSnowballwebpagefordetails.•RstemisanalternativeinterfacetoaCversionofPorter’swordstemmingalgorithm.•KoNLPprovidesacollectionofconversionroutines(e.g.HangultoJamos),stemming,andpartofspeechtaggingthroughinterfacingwiththeLucene’sHanNanumanalyzer.TextMining:•tmprovidesacomprehensivetextm。

10、iningframeworkforR.TheJournalofStatisticalSoftwarearticleTextMiningInfrastructureinRgivesadetailedoverviewandpresentstechniquesforcount-basedanalysismethods,textclustering,textclassiﬁcationandstringkernels.•lsaprovidesroutinesforperformingalatentsemanticanalysiswithR.Thebasicideaoflatentsemanticanalysis(LSA)is,thattextdohaveahigherorder(=latentsemantic)structurewhich,however,isobscuredbywordusage(e.g.throughtheuseofsynonyms2.2stemming和Tokenization用R语言做文本挖掘|5orpolysemy).Byusing。