crawler4j源码学习(1):搜狐新闻网新闻标题采集爬虫
crawler4j是用Java实现的开源网络爬虫。提供了简单易用的接口,可以在几分钟内创建一个多线程网络爬虫。下面实例结合jsoup,采集搜狐新闻网(http://news.sohu.com/)新闻标题信息。
所有的过程仅需两步完成:
第一步:建立采集程序核心部分
29 30 /** 31 * @date 2016年8月20日 上午11:52:13 32 * @version 33 * @since JDK 1.8 34 */ 35 public class MyCrawler extends WebCrawler { 36 37 //链接地址过滤// 38 private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg" + "|png|mp3|mp3|zip|gz))$"); 39 40 @Override 41 public boolean shouldVisit(Page referringPage, WebURL url) { 42 String href = url.getURL().toLowerCase(); 43 return !FILTERS.matcher(href).matches() && href.startsWith("http://news.sohu.com/"); 44 } 45 46 /** 47 * This function is called when a page is fetched and ready to be processed 48 * by your program. 49 */ 50 @Override 51 public void visit(Page page) { 52 String url = page.getWebURL().getURL(); 53 logger.info("URL: " + url); 54 55 if (page.getParseData() instanceof HtmlParseData) { 56 HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); 57 String text = htmlParseData.getText(); 58 String html = htmlParseData.getHtml(); 59 Set<WebURL> links = htmlParseData.getOutgoingUrls(); 60 61 logger.debug("Text length: " + text.length()); 62 logger.debug("Html length: " + html.length()); 63 logger.debug("Number of outgoing links: " + links.size()); 64 logger.info("Title: " + htmlParseData.getTitle()); 65 66 } 67 } 68 69 }
第二步:建立采集程序控制部分
28 /** 29 * @date 2016年8月20日 上午11:55:56 30 * @version 31 * @since JDK 1.8 32 */ 33 public class MyController { 34 35 /** 36 * @param args 37 * @since JDK 1.8 38 */ 39 public static void main(String[] args) { 40 // TODO Auto-generated method stub 41 42 //本地嵌入式数据库,采用berkeley DB 43 String crawlStorageFolder = "data/crawl/root"; 44 int numberOfCrawlers = 3; 45 46 CrawlConfig config = new CrawlConfig(); 47 config.setCrawlStorageFolder(crawlStorageFolder); 48 49 /* 50 * Instantiate the controller for this crawl. 51 */ 52 PageFetcher pageFetcher = new PageFetcher(config); 53 RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 54 RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); 55 CrawlController controller; 56 try { 57 controller = new CrawlController(config, pageFetcher, robotstxtServer); 58 controller.addSeed("http://news.sohu.com/"); 74 controller.start(MyCrawler.class, numberOfCrawlers); 75 } catch (Exception e) { 76 // TODO Auto-generated catch block 77 e.printStackTrace(); 78 } 79 80 } 81 82 }
采集结果展示: