您好,欢迎访问三七文档
当前位置:首页 > 电子/通信 > 数据通信与网络 > 基于主题的网页文本分类技术
北京联合大学毕业设计I摘要随着Web上信息的迅速扩展,各项基于Web的服务也逐渐繁荣起来。作为这些信息服务的基础和重要组成部分,Web信息采集正广泛应用于搜索引擎、站点结构分析、页面有效性分析、用户兴趣挖掘以及个性化信息获取等多种应用和研究中。然而,随着人们对提供的各项信息服务要求越来越高,传统的基于整个Web的信息采集也越来越力不从心,它无法及时地采集到足够的Web信息,也不能满足人们日益增长的个性化需求。为此,本项目面向互联网中存在的海量教育资源,对Web上满足特定主题的信息的有效采集进行研究。基于主题的Web教育资源采集技术的研究主要有三个研究内容:本体构建技术研究、主题爬虫技术研究以及网页文本分类技术研究。网页文本分类技术被广泛应用到搜索引擎中,本文对文本分类技术进行研究,介绍文本分类的基本过程,论述文本预处理、分词以及特征提取方法,讨论朴素贝叶斯、K近邻、支持向量机、投票等常用的文本分类原理与方法,探讨网页文本分类技术。采用支持向量机技术,设计并实现了一个开放的基于主题的网页文本分类系统。实验表明,它不仅具有较高的训练效率,同时能得到很高的分类准确率和查全率。关键词:主题,分词,向量空间模型,文本分类,支持向量机北京联合大学毕业设计IIAbstractWiththerapidexpansionofinformationontheWeb,theWeb-basedservicesaregraduallyflourished.Asthebasicandimportantcomponentoftheseinformationservices,Webinformationcollectionisbeingwidelyappliedtothesearchenginesitestructureanalysis,analysisoftheeffectivenessofthepage,theuserinterestinformationandpersonalizedaccesstotapavarietyofapplicationsandresearch.However,aspeopleofvariousinformationservicesrequireincreasinglyhigh,thetraditionalinformationcollectionbasedontheentireWebareincreasinglypowerless,itisunabletocollecttimelyinformationtotheWebenough,cannotmeetthegrowingindividualneeds.Tothisend,theprojectfacingmassiveeducationalresourcesontheInternetthatexistontheWebtomeetthespecifictopicofthecollectionofinformationoneffectiveresearch.ResearchtopicsofWeb-basededucationalresourceacquisitiontechnologythreemainresearchcontents:ontologyconstructiontechnologyresearch,technology,andresearchtopicsreptilespagetextclassificationtechnologyresearch.Webtextclassificationtechnologiesarewidelyappliedtothesearchengines,thispaperthebasicprocessoftextclassificationtechnologyresearch,introductorytextclassification,discussesthetextpre-processing,segmentationandfeatureextractionmethodsdiscussedNaiveBayes,Knearestneighbor,supportvectormachines,votingandothercommonlyusedtextclassificationprinciplesandmethodstoexplorepagetextclassificationtechniques.Usingsupportvectormachinetechnology,designandimplementationofaweb-basedopentopictextclassificationsystems.Experimentsshowthatitnotonlyhasahighertrainingefficiencywhiletogethighclassificationaccuracyandrecall.Keywords:theme,word,vectorspacemodel,Textcategorization,SupportVectorMachine(SVM)北京联合大学毕业设计III目录摘要.......................................................................................................................................IAbstract..................................................................................................................................II目录..................................................................................................................................III1引言...............................................................................................................................-1-1.1研究目的及意义.................................................................................................-1-1.2国内外研究现状.................................................................................................-1-2基于主题的Web信息采集......................................................................................-3-2.1基本原理.............................................................................................................-3-3网页文本分类技术.......................................................................................................-4-3.1文本分类系统构建.............................................................................................-4-3.1.1自动分词.................................................................................................-5-3.1.2特征选择.................................................................................................-6-3.1.3向量空间模型.........................................................................................-6-3.1.4TF*IDF启发式权重算法......................................................................-6-3.2文本分类方法.....................................................................................................-7-3.2.1k-近邻算法(KNN).............................................................................-8-3.2.2贝叶斯算法(NaiveBayes).................................................................-9-3.2.3决策树(DecisionTree)分类.............................................................-10-3.2.4基于投票的方法...................................................................................-10-3.2.5支持向量机(SVM)方法...................................................................-11-3.2.5.1支持向量机原理...........................................................................-11-3.2.5.2支持向量机的特点.......................................................................-13-4系统的设计与分析.....................................................................................................-15-北京联合大学毕业设计IV4.1基于主题的web文本分类系统的设计...........................................................-15-4.2本系统共分为六大步骤...................................................................................-15-4.2.1流程图...................................................................................................-15-4.2.2文集准备.........
本文标题:基于主题的网页文本分类技术
链接地址:https://www.777doc.com/doc-3127843 .html