您好,欢迎访问三七文档
I摘要网络爬虫(WebCrawler),通常被称为爬虫,是搜索引擎的重要组成部分。随着信息技术的飞速进步,作为搜索引擎的一个组成部分——网络爬虫,一直是研究的热点,它的好坏会直接决定搜索引擎的未来。目前,网络爬虫的研究包括Web搜索策略研究的研究和网络分析的算法,两个方向,其中在Web爬虫网络搜索主题是一个研究方向,根据一些网站的分析算法,过滤不相关的链接,连接到合格的网页,并放置在一个队列被抓取。把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛。网络蜘蛛是通过网页的链接地址来寻找网页,从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址,然后通过这些链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止。如果把整个互联网当成一个网站,那么网络爬虫就可以用这个原理把互联网上所有的网页都抓取下来。关键词:网络爬虫;LinuxSocket;C/C++;多线程;互斥锁IIAbstractWebCrawler,usuallycalledCrawlerforshort,isanimportantpartofsearchengine.Withthehigh-speeddevelopmentofinformation,WebCrawler--thesearchenginecannotlackof--whichisahotresearchtopicthoseyears.ThequalityofasearchengineismostlydependedonthequalityofaWebCrawler.Nowadays,thedirectionofresearchingWebCrawlermainlydividesintotwoparts:oneisthesearchingstrategytowebpages;theotheristhealgorithmofanalysisURLs.Amongthem,theresearchofTopic-FocusedWebCrawleristhetrend.Itusessomewebpageanalysisstrategytofiltertopic-lessURLsandaddfitURLsintoURL-WAITqueue.Themetaphorofaspiderwebinternet,thenSpiderspideriscrawlingaroundontheInternet.Webspiderthroughweblinkaddresstofindpages,startingfromaonepagewebsite(usuallyhome),readthecontentsofthepage,findtheaddressoftheotherlinksonthepage,andthenlookforthenextWebpageaddressesthroughtheselinks,sohasbeenthecyclecontinues,untilallthepagesofthissitearecrawledexhausted.IftheentireInternetasasite,thenyoucanusethisWebcrawlerprincipleallthepagesontheInternetarecrawlingdown..Keywords:Webcrawler;LinuxSocket;C/C++;Multithreading;MutexIII目录摘要............................................................................I第一章概述...................................................................11.1课题背景.................................................................................................................................................11.2网络爬虫的历史和分类.........................................................................................................................11.2.1网络爬虫的历史..........................................................................................................................11.2.2网络爬虫的分类..........................................................................................................................21.3网络爬虫的发展趋势.............................................................................................................................31.4系统开发的必要性.................................................................................................................................31.5本文的组织结构.....................................................................................................................................3第二章相关技术和工具综述.........................................................52.1网络爬虫的定义.....................................................................................................................................52.2网页搜索策略介绍.................................................................................................................................52.2.1广度优先搜索策略......................................................................................................................52.3相关工具介绍.........................................................................................................................................62.3.1操作系统......................................................................................................................................62.3.2软件配置......................................................................................................................................6第三章网络爬虫模型的分析和概要设计................................................83.1网络爬虫的模型分析.............................................................................................................................83.2网络爬虫的搜索策略.............................................................................................................................83.3网络爬虫的概要设计...........................................................................................................................10第四章网络爬虫模型的设计与实现...................................................124.1网络爬虫的总体设计...........................................................................................................................124.2网络爬虫的具体设计...........................................................................................................................124.2.1URL类设计及标准化URL.......................................................................................................124.2.2爬取网页....................................................................................................................................134.2.3网页分析....................................................................................................................................144.2.4网页存储....................................................................................................................................144.2.5Linuxsocket通信.............................................................
本文标题:网络爬虫论文
链接地址:https://www.777doc.com/doc-2072148 .html