Text-Mining-in-R

TechnicalReport2012R语言环境下的文本挖掘TextMininginRVersion0.0220120321刘思喆主页更新联系方式sunbjt@gmail.com新浪微博@刘思喆Copyright©2012RandalltheContributorstoRtm.Allrightsreserved.R以及Rtm的作者拥有版权©2012。保留所有权利。Permissionisgrantedtocopy,distributeand/ormodifythisdocumentunderthetermsoftheGNUFreeDocumentationLicense,Version1.2oranylaterversionpublishedbytheFreeSoftwareFoundation;withtheInvariantSectionsbeingContributors,noFront-CoverTexts,andnoBack-CoverTexts.你可以拷贝、发布或者修改这份文档，但必须遵守自由软件组织颁布的GNU自由文档许可证1.2或者以后版本的条款。InvariantSections包括Contributors，没有Front-CoverTexts和Back-CoverTexts。目录1文本挖掘介绍32自然语言处理技术述32.1相关的R包.......................................32.2stemming和Tokenization...............................52.3中文分词........................................63tm包63.1简介...........................................63.2数据读入........................................63.3数据输出........................................83.4语料库的提取......................................83.5信息转化........................................93.6转化为纯文本......................................93.6.1去除多余的空白.................................103.6.2小写变化....................................103.6.3停止词去除...................................103.6.4填充.......................................103.7过滤...........................................103.8元数据管理.......................................113.9标准操作和函数.....................................143.10创建词条-文档关系矩阵................................143.11对词条-文档关系矩阵操作...............................143.12字典...........................................164网页解析的利器–XML包174.1网页解析........................................174.2字符集转化.......................................215XML同tm包的配合使用（todo）216一些文本挖掘方面的应用216.1基础分析技术......................................226.1.1文本聚类....................................226.1.2文本分类....................................236.2潜变量语义分析(notdone)..............................236.3主题模型（Topicmodel）...............................24目录用R语言做文本挖掘|2A附录25A.1关于XML文件.....................................25A.2关于正则表达式.....................................26用R语言做文本挖掘|31文本挖掘介绍文本挖掘被描述为“自动化或半自动化处理文本的过程”，包含了文档聚类、文档分类、自然语言处理、文体变化分析及网络挖掘等领域内容。对于文本处理过程首先要拥有分析的语料（textcorpus），比如报告、信函、出版物等。而后根据这些语料建立半结构化的文本库（textdatabase）。而后生成包含词频的结构化的词条-文档矩阵（term-documentmatrix）。图1:文本挖掘的处理流程这个一般性数据结构会被用于后续的分析，比如：•文本分类，比如根据现有的文本分类情况，对未知文本进行归类；•语法分析；•信息提取和修复；•文档信息汇总，比如提取相关有代表性的关键词、句子等。2自然语言处理技术述2.1相关的R包PhoneticsandSpeechProcessing:emuisacollectionoftoolsforthecreation,manipulation,andanalysisofspeechdatabases.AtthecoreofEMUisadatabasesearchenginewhichallowstheresearchertoﬁndvariousspeechsegmentsbasedonthesequentialandhierarchicalstructureoftheutterancesinwhichtheyoccur.EMUincludesaninteractivelabellerwhichcandisplayspectrogramsandotherspeechwaveforms,andwhichallowsthecreationofhierarchical,aswellassequential,labelsforaspeechutterance.LexicalDatabases:wordnetprovidesanRinterfacetoWordNet,alargelexicaldatabaseofEnglish.2.1相关的R包用R语言做文本挖掘|4KeywordExtractionandGeneralStringManipulation:•R’sbasepackagealreadyprovidesarichsetofcharactermanipulationroutines.•RKEAprovidesanRinterfacetoKEA(Version5.0).KEA(forKeyphraseExtractionAlgorithm)allowsforextractingkeyphrasesfromtextdocuments.Itcanbeeitherusedforfreeindexingorforindexingwithacontrolledvocabulary.•gsubfncanbeusedforcertainparsingtaskssuchasextractingwordsfromstringsbycontentratherthanbydelimiters.demo(”gsubfn-gries”)showsanexampleofthisinanaturallanguageprocessingcontext.•taucontainsbasicstringmanipulationandanalysisroutinesneededintextprocessingsuchasdealingwithcharacterencoding,language,patterncounting,andtokenization.NaturalLanguageProcessing:•openNLPprovidesanRinterfacetoOpenNLP,acollectionofnaturallanguageprocess-ingtoolsincludingasentencedetector,tokenizer,pos-tagger,shallowandfullsyntacticparser,andnamed-entitydetector,usingtheMaxentJavapackagefortrainingandusingmaximumentropymodels.•openNLPmodels.enshipstrainedmodelsforEnglishandopenNLPmodels.esforSpanishtobeusedwithopenNLP.•RWekaisainterfacetoWekawhichisacollectionofmachinelearningalgorithmsfordataminingtaskswritteninJava.Especiallyusefulinthecontextofnaturallanguageprocessingisitsfunctionalityfortokenizationandstemming.•SnowballprovidestheSnowballstemmerswhichcontainthePorterstemmerandseveralotherstemmersfordiﬀerentlanguages.SeetheSnowballwebpagefordetails.•RstemisanalternativeinterfacetoaCversionofPorter’swordstemmingalgorithm.•KoNLPprovidesacollectionofconversionroutines(e.g.HangultoJamos),stemming,andpartofspeechtaggingthroughinterfacingwiththeLucene’sHanNanumanalyzer.TextMining:•tmprovidesacomprehensivetextminingframeworkforR.TheJournalofStatisticalSoftwarearticleTextMiningInfrastructureinRgivesadetailedoverviewandpresentstechniquesforcount-basedanalysismethods,textclustering,textclassiﬁcationandstringkernels.•lsaprovidesroutinesforperformingalatentsemanticanalysiswithR.Thebasicideaoflatentsemanticanalysis(LSA)is,thattextdohaveahigherorder(=latentsemantic)structurewhich,however,isobscuredbywordusage(e.g.throughtheuseofsynonyms2.2stemming和Tokenization用R语言做文本挖掘|5orpolysemy).Byusin

Text-Mining-in-R

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

重温新课程理念，搞好信息技术课教学

全数字交流伺服系统在塑料机械中的应用

硕士论文-影壁建筑意——中国古建筑影壁研究

国际金融第7章(3)

富士康模具考试资料

《汽车材料》第五章汽车用非金属材料

吗啡和精神兴奋剂的行为敏感化及其神经生物机制

加大电力结构调整力度关停小火电机组提高电力工业的经济和环保效益

新会计准则对房地产行业影响分析

XXXX最新我国的基本经济制度

相关文档

相关搜索