刘景发,李帆,蒋盛益.基于综合优先度和主机信息的暴雨灾害主题退火爬虫算法[J].计算机科学,2019,46(2):215-222
基于综合优先度和主机信息的暴雨灾害主题退火爬虫算法
Focused Annealing Crawler Algorithm for Rainstorm Disasters Based on Comprehensive Priority and Host Information
投稿时间:2018-07-12  修订日期:2018-09-13
DOI:
中文关键词:  暴雨灾害,网络主题爬虫,综合优先度,主机信息,模拟退火算法
英文关键词:Rainstorm disasters,Web focused crawler,Comprehensive priority,Host information,Simulated annealing algorithm
基金项目:本文受国家社会科学基金重大招标项目(16ZDA047),国家自然科学基金项目(61373016),江苏省自然科学基金项目(BK20181409,BK20171458)资助
作者单位E-mail
刘景发 南京信息工程大学计算机与软件学院 南京210044
广东外语外贸大学信息科学与技术学院 广州510006 
jfliu@nuist.edu.cn 
李帆 南京信息工程大学计算机与软件学院 南京210044  
蒋盛益 广东外语外贸大学信息科学与技术学院 广州510006  
摘要点击次数: 0
全文下载次数: 0
中文摘要:
      如今,互联网集成的与暴雨灾害相关的信息多种多样,然而人工搜索网页信息的效率不高,因此网络主题爬虫显得十分重要。在通用网络爬虫的基础上,为提高主题相关度的计算精度并预防主题漂移,通过对链接锚文本主题相关度、链接所在网页的主题相关度、链接指向网页PR值和该网页主题相关度的综合计算,提出了基于网页内容和链接结构相结合的超链接综合优先度评估方法。同时,针对搜索过程易陷入局部最优的不足,首次设计了结合爬虫记忆历史主机信息和模拟退火的网络主题爬虫算法。以暴雨灾害为主题进行爬虫实验的结果表明,在爬取相同网页数的情况下,相比于广度优先搜索策略(Breadth First Search,BFS)和最佳优先搜索策略(Optimal Priority Search,OPS),所提出的算法能抓取到更多与主题相关的网页,爬虫算法的准确率得到明显提升。
英文摘要:
      Nowadays,Internet integrates a lot of information related to rainstorm disasters.However,the efficiency of manual search is low,so the web focused crawler becomes very important.On the basis of the general web crawler,in order to improve the computational precision of topic relevance for webpages and prevent the topic drift,this paper proposed a comprehensive priority evaluation method based on webpage content and link structure for the hyperlink.The method consists of a combined effect of four parts,including the topic relevance of the anchor text,the topic relevance of all webpages that contain links,the PR value and the topic relevance of the webpage which the link points to.At the same time,to avoid the search falling into local optimum,a new focused crawler algorithm combining the memory historical host information and the simulated annealing algorithm (SA) was designed for the first time.The experimental results of the focused crawler about rainstorm disaster show that the proposed algorithm outperforms the breadth first search (BFS) strategy and the optimal priority search (OPS) strategy,and the crawling accuracy rate is significantly improved.
查看全文  查看/发表评论  下载PDF阅读器