您好,欢迎访问三七文档
当前位置:首页 > IT计算机/网络 > 其它相关文档 > 互联网网页文本对象抽取实现技术
湖南大学毕业论文第I页湖南大学软件学院互联网网页文本对象抽取实现技术摘要互联网中蕴含着大量的关于现实世界对象的结构化信息。为了能应对信息爆炸带来的严重挑战,抽取、集成网页上各式各样的文本对象信息,进行对象级别的搜索,迫切需要一些自动化的技术帮助人们在海量信息中迅速找到自己真正需要的信息。网页文本对象抽取实现技术正是解决这个问题的一种方法。本文以传统的信息抽取理论和方法为基础,针对目前热门的博客领域,提出了一种基于HTML特征和机器学习的博客正文抽取算法。在该算法中,研究了博客网页的特征,提出了一种基于HTML标签特征的网页分块算法,使用决策树算法对博客数据集进行统计训练,采用专门的统计工具WEKA对该算法进行了测试和评估,并总结出该算法的优点以及可以改进的地方。最后,展示了基于该博客正文抽取算法的博客搜索引擎Geeseek的系统结构和界面演示。该系统属于新型的垂直搜索引擎,能够对博客和博文进行快速有效的搜索。据了解,Geeseek也是目前国内高校中第一个博客搜索引擎。关键词:互联网,信息爆炸,信息抽取,博客,HTML,机器学习,决策树,搜索引擎,Geeseek湖南大学毕业论文第II页湖南大学软件学院ImplementationoftextobjectextractionforInternetwebpagesAuthor:ZhangHuiTutor:LinYapingAbstractNowadays,thereisalargenumberofsemi-structuralinformationwhichrepresentsobjectsintherealworldontheInternet.Inordertodealwiththeseverechallengebroughtbyinformationexplosion,extractandintegrateallkindsoftextobjectinformationonwebpages,andputuptheobject-levelsearching,itcriesfortheautomatedtechnologiestohelppeoplefindtheveryinformationtheyreallyneedamongsuchalargenumberofinformation.Thetechnologyoftextobjectextractionisjustoneofmethodstosolvethisproblem.BasedonthetraditionaltheoryofInformationExtractionandaimingattheblogdomain,thispaperputsforwardanarithmeticimplementingtheextractionfunctionforthetextobjectsofblogarticleswiththeHTMLfeaturesandmachinelearning.Inthisarithmetic,itanalysesthefeaturesofblogpages,introducesanarithmeticforwebpagepartitionbasingontheHTMLtagfeatures,usesdecisiontreetodostatisticsandtrainingontheblogdataset,testsandevaluatesthisarithmeticusingtheexpertstatisticaltool,WEKA,andsummarizestheadvantagesaswellasthepointsneedingimproving.Finally,itshowsthesystemarchitectureandinterfacepresentationoftheGeeseek,ablogSearchEnginewhichappliesthetechnologyoftextobjectextractionforblogpages.Thissystemblongstothenew-styleverticalSearchEngineandisabletosearchforthebloghomepagesandblogarticlepagesquicklyandeffectively.Sofarasweknow,GeeseekisthefirstblogSearchEngineinallthecollegesinChina.Keywords:Internet,informationexplosion,InformationExtraction,blog,HTML,machinelearning,SearchEngine,decisiontree,Geeseek湖南大学毕业论文第III页湖南大学软件学院目录1.绪论.......................................................................................................................................11.1课题背景及目的..............................................................................................................11.2国内外研究状况..............................................................................................................31.2.1国内研究现状...........................................................................................................31.2.2国外研究现状...........................................................................................................41.3课题研究方法..................................................................................................................51.4论文构成及研究内容......................................................................................................52.Web信息抽取及网页文本对象抽取概述............................................................................72.1Web信息抽取的概念......................................................................................................72.2Web信息抽取的方法......................................................................................................82.3Web信息抽取的典型流程..............................................................................................92.4网页文本对象抽取的理论和方法.................................................................................113.博客正文信息抽取系统的设计.........................................................................................143.1博客搜索的概况............................................................................................................143.2博客正文抽取的过程....................................................................................................153.2.1分类.........................................................................................................................153.2.2分块.........................................................................................................................183.2.3统计训练,获取决策树.........................................................................................213.3算法的测试和评估........................................................................................................243.4博客正文抽取算法的意义和思考................................................................................254.基于博客正文抽取的Geeseek搜索引擎..........................................................................274.1Geeseek系统介绍..........................................................................................................274.2博客正文抽取模块.........................................................................................................284.2.1博客正文抽取模块简介.........................................................................................284.2.2博客正文抽取模块的主要数据类.........................................................................294.2.3博客正文抽取模块的实现思路....................................................................
本文标题:互联网网页文本对象抽取实现技术
链接地址:https://www.777doc.com/doc-5942184 .html