您好,欢迎访问三七文档
当前位置:首页 > 电子/通信 > 数据通信与网络 > 基于Web的文本分类挖掘的研究
1论文编码:TP181首都师范大学学士学位论文基于Web的文本分类挖掘的研究院系信息工程学院专业计算机科学与技术系(师范)年级2001学号1011000035指导老师刘丽珍论文作者徐颖完成日期2005年6月6日首都师范大学CapitalNormalUniversity第1页共33页中文提要互联网现在已经成为一个巨大的信息源,如何让互联网信息更好地为人类服务,如何快速、准确获取所需信息,是我们面临的一个重要课题。因此,基于Web的网络信息处理成了当前的研究热点,其中,Web上的文本分类方法的研究是网络数据挖掘的研究重点之一。本文介绍了数据挖掘,Web挖掘和文本分类的理论,对Web数据的特点作了分析,比较了HTML与传统数据的区别,分析了文本分类的几种算法,重点研究了朴素贝叶斯分类算法和算法改进的具体过程。尝试利用HTML标记权重来改善朴素贝叶斯算法的条件独立假设的不足。简述了现有的对网页的标记过滤的知识,并利用标记中的有用信息结合文本分类算法进行文本分类。最后,针对改进的分类器的在精确率上不太理想的特点,对本课题下一步要研究的内容进行了总结,并提出了自己的一些看法。关键词Web挖掘朴素贝叶斯数据挖掘文本分类网页标记首都师范大学CapitalNormalUniversity第2页共33页ResearchofTextClassificationMiningbasedonWEBABSTRACTInternethasbecomeagreatinformationsource.ItisanimportantissuesforustoconfrontthathowtomaketheInternetinformationservepeoplebetterandhowtoobtaintheinformationquicklyandaccurately.NowadaystheResearchofinformationprocessingbasedonwebisahotspot.Thetextcategorizationofwebhasbecamemoreimportantthantheotherresearchofwebmining.Thetheoreticaldevelopmentofdatamining,Webminingandtextclassificationareintroduced,analyzesthefeatureofWebdata,compareswiththeotherdatanaivebayesclassifier.Analyzessomearithmeticsoftextcategorizationandtheconcreteprocessoftheimprovementofarithmeticinnaivebayesclassifierareputemphasison.ThisthesistriestomakeuseofHTMLtagstoimprovethearithmeticofnaivebayesclassifierwhosebugisitshypothesis.Inthepracticeoftheclassifier,thethesissummarizesthemethodwhichcanleachHTMLtags,thentriestousetheinformationfromthetagsandthetextcategorizationarithmetictoclassifythetext.Finally,theprecisionoftheclassifierwhichhasbeenimprovedisnotideal,sothenextcontentsofthissubjectaresummarizedandsomeone'sownviewsarealsopresented.XuYingDirectedbyLiuLi-zhenKeywordWebMiningNaïveBayesDataMiningTextcategorizationHTMLtags首都师范大学CapitalNormalUniversity第3页共33页目录中文提要..............................................................................................................................................................1外文提要...............................................................................................................................错误!未定义书签。第一章绪论................................................................................................................................................41.1选题背景及意义..........................................................................................................................41.2数据挖掘......................................................................................................................................41.3Web挖掘......................................................................................................................................51.4Web挖掘的研究现状与发展......................................................................................................81.5本文的主要研究内容与组织结构..............................................................................................9第二章基于Web的文本分类挖掘.............................................................................................................92.1引言..............................................................................................................................................92.2Web文本的预处理....................................................................................................................102.2.1Web文本数据采集..........................................................................................................102.2.2文本分词..........................................................................................................................102.2.3文本特征库.........................................................................................................................112.3文本分类.....................................................................................................................................112.3.1常用的文本分类方法......................................................................................................122.3.2文本分类方法的比较......................................................................................................132.3.3Web文本分类的特点.........................................................................................................142.4分类性能评价方法....................................................................................................................142.5本章小结....................................................................................................................................15第三章朴素贝叶斯分类方法的研究........................................................................................................153.1朴素贝叶斯分类简介.................................................................................................................................153.2问题的提出............................................................................................................................................163.3具体的解决方法......................................................................................................................................173.4实验结果.................
本文标题:基于Web的文本分类挖掘的研究
链接地址:https://www.777doc.com/doc-6105466 .html