网络爬虫

一.使用的技术

这个爬虫是近半个月前学习爬虫技术的一个小例子,比较简单,怕时间久了会忘,这里简单总结一下.主要用到的外部Jar包有HttpClient4.3.4,HtmlParser2.1,使用的开发工具(IDE)为intelij 13.1,Jar包管理工具为Maven,不习惯用intelij的同学,也可以使用eclipse新建一个项目.

二.爬虫基本知识

1.什么是网络爬虫?(爬虫的基本原理)

网络爬虫,拆开来讲,网络即指互联网,互联网就像一个蜘蛛网一样,爬虫就像是蜘蛛一样可以到处爬来爬去,把爬来的数据再进行加工处理.

百科上的解释:网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动的抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁，自动索引，模拟程序或者蠕虫。

基本原理:传统爬虫从一个或若干初始网页的URL开始，获得初始网页上的URL，在抓取网页的过程中，不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件，流程图所示。聚焦爬虫的工作流程较为复杂，需要根据一定的网页分析算法过滤与主题无关的链接，保留有用的链接并将其放入等待抓取的URL队列。然后，它将根据一定的搜索策略从队列中选择下一步要抓取的网页URL，并重复上述过程，直到达到系统的某一条件时停止

2.常用的爬虫策略有哪些?

网页的抓取策略可以分为深度优先、广度优先和最佳优先三种。深度优先在很多情况下会导致爬虫的陷入(trapped)问题，目前常见的是广度优先和最佳优先方法。

2.1广度优先(Width-First)

广度优先遍历是连通图的一种遍历策略。因为它的思想是从一个顶点V0开始，辐射状地优先遍历其周围较广的区域,故得名.

其基本思想:
1)、从图中某个顶点V0出发，并访问此顶点； 2)、从V0出发，访问V0的各个未曾访问的邻接点W1，W2，…,Wk;然后,依次从W1,W2,…,Wk出发访问各自未被访问的邻接点； 3)、重复步骤2，直到全部顶点都被访问为止。
如下图所示:

2.2深度优先(Depth-First)
假设初始状态是图中所有顶点都未被访问，则深度优先搜索方法的步骤是： 1）选取图中某一顶点Vi为出发点，访问并标记该顶点； 2）以Vi为当前顶点，依次搜索Vi的每个邻接点Vj，若Vj未被访问过，则访问和标记邻接点Vj，若Vj已被访问过，则搜索Vi的下一个邻接点； 3）以Vj为当前顶点，重复步骤2，直到图中和Vi有路径相通的顶点都被访问为止； 4）若图中尚有顶点未被访问过（非连通的情况下），则可任取图中的一个未被访问的顶点作为出发点，重复上述过程，直至图中所有顶点都被访问。
下面以一个有向图和一个无向图为例:

广度和深度和区别:

广度优先遍历是以层为顺序，将某一层上的所有节点都搜索到了之后才向下一层搜索；而深度优先遍历是将某一条枝桠上的所有节点都搜索到了之后，才转向搜索另一条枝桠上的所有节点。

2.3 最佳优先搜索

最佳优先搜索策略按照一定的网页分析算法，预测候选URL与目标网页的相似度，或与主题的相关性，并选取评价最好的一个或几个URL进行抓取。它只访问经过网页分析算法预测为“有用”的网页。这种搜索适合暗网数据的爬取,只要符合要求的内容.

3.本文爬虫示例图

本文介绍的例子是抓取新闻类的信息,因为一般新闻类的信息,重要的和时间近的都会放在首页,处在网络层中比较深的信息的重要性一般将逐级降低,所以广度优先算法更适合,下图是本文将要抓取的网页结构图:

三.广度优先爬虫示例

1.需求:抓取复旦新闻信息(只抓取100个网页信息)

这里只抓取100条信息,并用url必须以new.fudan.edu.cn开头.

2.代码实现

使用maven引入外部jar包:

view source print ?

01.<dependency>

02.<groupId>org.apache.httpcomponents</groupId>

03.<artifactId>httpclient</artifactId>

04.<version>4.3.4</version>

05.</dependency>

06.<dependency>

07.<groupId>org.htmlparser</groupId>

08.<artifactId>htmlparser</artifactId>

09.<version>2.1</version>

10.</dependency>

程序主入口:

view source print ?

01.package com.amos.crawl;

02.

03.import java.util.Set;

04.

05./**

06.* Created by amosli on 14-7-10.

07.*/

08.public class MyCrawler {

09./**

10.* 使用种子初始化URL队列

11.*

12.* @param seeds

13.*/

14.private void initCrawlerWithSeeds(String[] seeds) {

15.for (int i = 0; i < seeds.length; i++) {

16.LinkQueue.addUnvisitedUrl(seeds[i]);

17.}

18.}

19.

20.public void crawling(String[] seeds) {

21.//定义过滤器,提取以http://news.fudan.edu.cn/的链接

22.LinkFilter filter = new LinkFilter() {

23.@Override

24.public boolean accept(String url) {

25.if (url.startsWith("http://news.fudan.edu.cn")) {

26.return true;

27.}

28.return false;

29.}

30.};

31.//初始化URL队列

32.initCrawlerWithSeeds(seeds);

33.

34.int count=0;

35.//循环条件:待抓取的链接不为空抓取的网页最多100条

36.while (!LinkQueue.isUnvisitedUrlsEmpty() && LinkQueue.getVisitedUrlNum() <= 100) {

37.

38.System.out.println("count:"+(++count));

39.

40.//附头URL出队列

41.String visitURL = (String) LinkQueue.unVisitedUrlDeQueue();

42.DownLoadFile downloader = new DownLoadFile();

43.//下载网页

44.downloader.downloadFile(visitURL);

45.//该URL放入怩访问的URL中

46.LinkQueue.addVisitedUrl(visitURL);

47.//提取出下载网页中的URL

48.Set<String> links = HtmlParserTool.extractLinks(visitURL, filter);

49.

50.//新的未访问的URL入列

51.for (String link : links) {

52.System.out.println("link:"+link);

53.LinkQueue.addUnvisitedUrl(link);

54.}

55.}

56.

57.}

58.

59.public static void main(String args[]) {

60.//程序入口

61.MyCrawler myCrawler = new MyCrawler();

62.myCrawler.crawling(new String[]{"http://news.fudan.edu.cn/news/"});

63.}

64.

65.}

工具类:Tools.java

view source print ?

001.package com.amos.tool;

002.

003.import java.io.*;

004.import java.net.URI;

005.import java.net.URISyntaxException;

006.import java.net.UnknownHostException;

007.import java.security.KeyManagementException;

008.import java.security.KeyStoreException;

009.import java.security.NoSuchAlgorithmException;

010.import java.security.cert.CertificateException;

011.import java.security.cert.X509Certificate;

012.import java.util.Locale;

013.

014.import javax.net.ssl.SSLContext;

015.import javax.net.ssl.SSLException;

016.

017.import org.apache.http.*;

018.import org.apache.http.client.CircularRedirectException;

019.import org.apache.http.client.CookieStore;

020.import org.apache.http.client.HttpRequestRetryHandler;

021.import org.apache.http.client.RedirectStrategy;

022.import org.apache.http.client.config.RequestConfig;

023.import org.apache.http.client.methods.HttpGet;

024.import org.apache.http.client.methods.HttpHead;

025.import org.apache.http.client.methods.HttpUriRequest;

026.import org.apache.http.client.methods.RequestBuilder;

027.import org.apache.http.client.protocol.HttpClientContext;

028.import org.apache.http.client.utils.URIBuilder;

029.import org.apache.http.client.utils.URIUtils;

030.import org.apache.http.conn.ConnectTimeoutException;

031.import org.apache.http.conn.HttpClientConnectionManager;

032.import org.apache.http.conn.ssl.SSLConnectionSocketFactory;

033.import org.apache.http.conn.ssl.SSLContextBuilder;

034.import org.apache.http.conn.ssl.TrustStrategy;

035.import org.apache.http.cookie.Cookie;

036.import org.apache.http.impl.client.*;

037.import org.apache.http.impl.conn.BasicHttpClientConnectionManager;

038.import org.apache.http.impl.cookie.BasicClientCookie;

039.import org.apache.http.protocol.HttpContext;

040.import org.apache.http.util.Args;

041.import org.apache.http.util.Asserts;

042.import org.apache.http.util.TextUtils;

043.import org.omg.CORBA.Request;

044.

045./**

046.* Created by amosli on 14-6-25.

047.*/

048.public class Tools {

049.

050.

051./**

052.* 写文件到本地

053.*

054.* @param httpEntity

055.* @param filename

056.*/

057.public static void saveToLocal(HttpEntity httpEntity, String filename) {

058.

059.try {

060.

061.File dir = new File(Configuration.FILEDIR);

062.if (!dir.isDirectory()) {

063.dir.mkdir();

064.}

065.

066.File file = new File(dir.getAbsolutePath() + "/" + filename);

067.FileOutputStream fileOutputStream = new FileOutputStream(file);

068.InputStream inputStream = httpEntity.getContent();

069.

070.byte[] bytes = new byte[1024];

071.int length = 0;

072.while ((length = inputStream.read(bytes)) > 0) {

073.fileOutputStream.write(bytes, 0, length);

074.}

075.inputStream.close();

076.fileOutputStream.close();

077.} catch (Exception e) {

078.e.printStackTrace();

079.}

080.

081.}

082.

083./**

084.* 写文件到本地

085.*

086.* @param bytes

087.* @param filename

088.*/

089.public static void saveToLocalByBytes(byte[] bytes, String filename) {

090.

091.try {

092.

093.File dir = new File(Configuration.FILEDIR);

094.if (!dir.isDirectory()) {

095.dir.mkdir();

096.}

097.

098.File file = new File(dir.getAbsolutePath() + "/" + filename);

099.FileOutputStream fileOutputStream = new FileOutputStream(file);

100.fileOutputStream.write(bytes);

101.//fileOutputStream.write(bytes, 0, bytes.length);

102.fileOutputStream.close();

103.} catch (Exception e) {

104.e.printStackTrace();

105.}

106.

107.}

108.

109./**

110.* 输出

111.* @param string

112.*/

113.public static void println(String string){

114.System.out.println("string:"+string);

115.}

116./**

117.* 输出

118.* @param string

119.*/

120.public static void printlnerr(String string){

121.System.err.println("string:"+string);

122.}

123.

124.

125./**

126.* 使用ssl通道并设置请求重试处理

127.* @return

128.*/

129.public static CloseableHttpClient createSSLClientDefault() {

130.try {

131.SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() {

132.//信任所有

133.public boolean isTrusted(X509Certificate[] chain,String authType) throws CertificateException {

134.return true;

135.}

136.}).build();

137.

138.SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext);

139.

140.//设置请求重试处理,重试机制,这里如果请求失败会重试5次

141.HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() {

142.@Override

143.public boolean retryRequest(IOException exception, int executionCount, HttpContext context) {

144.if (executionCount >= 5) {

145.// Do not retry if over max retry count

146.return false;

147.}

148.if (exception instanceof InterruptedIOException) {

149.// Timeout

150.return false;

151.}

152.if (exception instanceof UnknownHostException) {

153.// Unknown host

154.return false;

155.}

156.if (exception instanceof ConnectTimeoutException) {

157.// Connection refused

158.return false;

159.}

160.if (exception instanceof SSLException) {

161.// SSL handshake exception

162.return false;

163.}

164.HttpClientContext clientContext = HttpClientContext.adapt(context);

165.HttpRequest request = clientContext.getRequest();

166.boolean idempotent = !(request instanceof HttpEntityEnclosingRequest);

167.if (idempotent) {

168.// Retry if the request is considered idempotent

169.return true;

170.}

171.return false;

172.}

173.};

174.

175.//请求参数设置,设置请求超时时间为20秒,连接超时为10秒,不允许循环重定向

176.RequestConfig requestConfig = RequestConfig.custom()

177..setConnectionRequestTimeout(20000).setConnectTimeout(20000)

178..setCircularRedirectsAllowed(false)

179..build();

180.

181.Cookie cookie ;

182.return HttpClients.custom().setSSLSocketFactory(sslsf)

183..setUserAgent("Mozilla/5.0 (X11; <a href="http://www.it165.net/os/oslin/" target="_blank" class="keylink">Linux</a> x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36")

184..setMaxConnPerRoute(25).setMaxConnPerRoute(256)

185..setRetryHandler(retryHandler)

186..setRedirectStrategy(new SelfRedirectStrategy())

187..setDefaultRequestConfig(requestConfig)

188..build();

189.

190.} catch (KeyManagementException e) {

191.e.printStackTrace();

192.} catch (NoSuchAlgorithmException e) {

193.e.printStackTrace();

194.} catch (KeyStoreException e) {

195.e.printStackTrace();

196.}

197.return HttpClients.createDefault();

198.}

199.

200./**

201.* 带cookiestore

202.* @param cookieStore

203.* @return

204.*/

205.

206.public static CloseableHttpClient createSSLClientDefaultWithCookie(CookieStore cookieStore) {

207.try {

208.SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() {

209.//信任所有

210.public boolean isTrusted(X509Certificate[] chain,String authType) throws CertificateException {

211.return true;

212.}

213.}).build();

214.

215.SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext);

216.

217.//设置请求重试处理,重试机制,这里如果请求失败会重试5次

218.HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() {

219.@Override

220.public boolean retryRequest(IOException exception, int executionCount, HttpContext context) {

221.if (executionCount >= 5) {

222.// Do not retry if over max retry count

223.return false;

224.}

225.if (exception instanceof InterruptedIOException) {

226.// Timeout

227.return false;

228.}

229.if (exception instanceof UnknownHostException) {

230.// Unknown host

231.return false;

232.}

233.if (exception instanceof ConnectTimeoutException) {

234.// Connection refused

235.return false;

236.}

237.if (exception instanceof SSLException) {

238.// SSL handshake exception

239.return false;

240.}

241.HttpClientContext clientContext = HttpClientContext.adapt(context);

242.HttpRequest request = clientContext.getRequest();

243.boolean idempotent = !(request instanceof HttpEntityEnclosingRequest);

244.if (idempotent) {

245.// Retry if the request is considered idempotent

246.return true;

247.}

248.return false;

249.}

250.};

251.

252.//请求参数设置,设置请求超时时间为20秒,连接超时为10秒,不允许循环重定向

253.RequestConfig requestConfig = RequestConfig.custom()

254..setConnectionRequestTimeout(20000).setConnectTimeout(20000)

255..setCircularRedirectsAllowed(false)

256..build();

257.

258.

259.return HttpClients.custom().setSSLSocketFactory(sslsf)

260..setUserAgent("Mozilla/5.0 (X11; <a href="http://www.it165.net/os/oslin/" target="_blank" class="keylink">Linux</a> x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36")

261..setMaxConnPerRoute(25).setMaxConnPerRoute(256)

262..setRetryHandler(retryHandler)

263..setRedirectStrategy(new SelfRedirectStrategy())

264..setDefaultRequestConfig(requestConfig)

265..setDefaultCookieStore(cookieStore)

266..build();

267.

268.} catch (KeyManagementException e) {

269.e.printStackTrace();

270.} catch (NoSuchAlgorithmException e) {

271.e.printStackTrace();

272.} catch (KeyStoreException e) {

273.e.printStackTrace();

274.}

275.return HttpClients.createDefault();

276.}

277.

278.}

View Code
将网页写入到本地的下载类:DownLoadFile.java

view source print ?

001.package com.amos.crawl;

002.

003.import com.amos.tool.Configuration;

004.import com.amos.tool.Tools;

005.import org.apache.http.*;

006.import org.apache.http.client.HttpClient;

007.import org.apache.http.client.HttpRequestRetryHandler;

008.import org.apache.http.client.config.RequestConfig;

009.import org.apache.http.client.methods.HttpGet;

010.import org.apache.http.client.protocol.HttpClientContext;

011.import org.apache.http.conn.ClientConnectionManager;

012.import org.apache.http.conn.ConnectTimeoutException;

013.import org.apache.http.impl.client.AutoRetryHttpClient;

014.import org.apache.http.impl.client.DefaultHttpClient;

015.import org.apache.http.protocol.HttpContext;

016.

017.import javax.net.ssl.SSLException;

018.import java.io.*;

019.import java.net.UnknownHostException;

020.

021.

022./**

023.* Created by amosli on 14-7-9.

024.*/

025.public class DownLoadFile {

026.

027.public String getFileNameByUrl(String url, String contentType) {

028.//移除http http://

029.url = url.contains("http://") ? url.substring(7) : url.substring(8);

030.

031.//text/html类型

032.if (url.contains(".html")) {

033.url = url.replaceAll("[\\?/:*|<>\"]", "_");

034.} else if (contentType.indexOf("html") != -1) {

035.url = url.replaceAll("[\\?/:*|<>\"]", "_") + ".html";

036.} else {

037.url = url.replaceAll("[\\?/:*|<>\"]", "_") + "." + contentType.substring(contentType.lastIndexOf("/") + 1);

038.}

039.return url;

040.}

041.

042./**

043.* 将网页写入到本地

044.* @param data

045.* @param filePath

046.*/

047.private void saveToLocal(byte[] data, String filePath) {

048.

049.try {

050.DataOutputStream out = new DataOutputStream(new FileOutputStream(new File(filePath)));

051.for(int i=0;i<data.length;i++){

052.out.write(data[i]);

053.}

054.out.flush();

055.out.close();

056.

057.} catch (Exception e) {

058.e.printStackTrace();

059.}

060.}

061.

062./**

063.* 写文件到本地

064.*

065.* @param httpEntity

066.* @param filename

067.*/

068.public static void saveToLocal(HttpEntity httpEntity, String filename) {

069.

070.try {

071.

072.File dir = new File(Configuration.FILEDIR);

073.if (!dir.isDirectory()) {

074.dir.mkdir();

075.}

076.

077.File file = new File(dir.getAbsolutePath() + "/" + filename);

078.FileOutputStream fileOutputStream = new FileOutputStream(file);

079.InputStream inputStream = httpEntity.getContent();

080.

081.if (!file.exists()) {

082.file.createNewFile();

083.}

084.byte[] bytes = new byte[1024];

085.int length = 0;

086.while ((length = inputStream.read(bytes)) > 0) {

087.fileOutputStream.write(bytes, 0, length);

088.}

089.inputStream.close();

090.fileOutputStream.close();

091.} catch (Exception e) {

092.e.printStackTrace();

093.}

094.

095.}

096.

097.

098.public String downloadFile(String url) {

099.

100.//文件路径

101.String filePath=null;

102.

103.//1.生成HttpClient对象并设置参数

104.HttpClient httpClient = Tools.createSSLClientDefault();

105.

106.//2.HttpGet对象并设置参数

107.HttpGet httpGet = new HttpGet(url);

108.

109.//设置get请求超时5s

110.//方法1

111.//httpGet.getParams().setParameter("connectTimeout",5000);

112.//方法2

113.RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(5000).build();

114.httpGet.setConfig(requestConfig);

115.

116.try {

117.HttpResponse httpResponse = httpClient.execute(httpGet);

118.int statusCode = httpResponse.getStatusLine().getStatusCode();

119.if(statusCode!= HttpStatus.SC_OK){

120.System.err.println("Method failed:"+httpResponse.getStatusLine());

121.filePath=null;

122.}

123.

124.filePath=getFileNameByUrl(url,httpResponse.getEntity().getContentType().getValue());

125.saveToLocal(httpResponse.getEntity(),filePath);

126.

127.} catch (Exception e) {

128.e.printStackTrace();

129.}

130.

131.return filePath;

132.

133.}

134.

135.

136.

137.public static void main(String args[]) throws IOException {

138.String url = "http://websearch.fudan.edu.cn/search_dep.html";

139.HttpClient httpClient = new DefaultHttpClient();

140.HttpGet httpGet = new HttpGet(url);

141.HttpResponse httpResponse = httpClient.execute(httpGet);

142.Header contentType = httpResponse.getEntity().getContentType();

143.

144.System.out.println("name:" + contentType.getName() + "value:" + contentType.getValue());

145.System.out.println(new DownLoadFile().getFileNameByUrl(url, contentType.getValue()));

146.

147.}

148.

149.

150.}

View Code
创建一个过滤接口:LinkFilter.java

view source print ?

01.package com.amos.crawl;

02.

03./**

04.* Created by amosli on 14-7-10.

05.*/

06.public interface LinkFilter {

07.

08.public boolean accept(String url);

09.

10.}

使用HtmlParser的过滤url的方法:HtmlParserTool.java

view source print ?

01.package com.amos.crawl;

02.

03.import org.htmlparser.Node;

04.import org.htmlparser.NodeFilter;

05.import org.htmlparser.Parser;

06.import org.htmlparser.filters.NodeClassFilter;

07.import org.htmlparser.filters.OrFilter;

08.import org.htmlparser.tags.LinkTag;

09.import org.htmlparser.util.NodeList;

10.

11.import java.util.HashSet;

12.import java.util.Set;

13.

14./**

15.* Created by amosli on 14-7-10.

16.*/

17.public class HtmlParserTool {

18.public static Set<String> extractLinks(String url, LinkFilter filter) {

19.Set<String> links = new HashSet<String>();

20.

21.try {

22.Parser parser = new Parser(url);

23.parser.setEncoding("GBK");

24.//过滤<frame>标签的filter,用来提取frame标签里的src属性

25.NodeFilter framFilter = new NodeFilter() {

26.@Override

27.public boolean accept(Node node) {

28.if (node.getText().contains("frame src=")) {

29.return true;

30.} else {

31.return false;

32.}

33.

34.}

35.};

36.

37.//OrFilter来设置过滤<a>标签和<frame>标签

38.OrFilter linkFilter = new OrFilter(new NodeClassFilter(LinkTag.class), framFilter);

39.//得到所有经过过滤的标签

40.NodeList list = parser.extractAllNodesThatMatch(linkFilter);

41.for (int i = 0; i < list.size(); i++) {

42.Node tag = list.elementAt(i);

43.if (tag instanceof LinkTag) {

44.tag = (LinkTag) tag;

45.String linkURL = ((LinkTag) tag).getLink();

46.

47.//如果符合条件那么将url添加进去

48.if (filter.accept(linkURL)) {

49.links.add(linkURL);

50.}

51.

52.} else {//frame 标签

53.//frmae里src属性的链接,如<frame src="test.html" />

54.String frame = tag.getText();

55.int start = frame.indexOf("src=");

56.frame = frame.substring(start);

57.

58.int end = frame.indexOf(" ");

59.if (end == -1) {

60.end = frame.indexOf(">");

61.}

62.String frameUrl = frame.substring(5, end - 1);

63.if (filter.accept(frameUrl)) {

64.links.add(frameUrl);

65.}

66.}

67.

68.}

69.

70.} catch (Exception e) {

71.e.printStackTrace();

72.}

73.

74.return links;

75.}

76.

77.

78.}

管理网页url的实现队列: Queue.java

view source print ?

01.package com.amos.crawl;

02.

03.import java.util.LinkedList;

04.

05./**

06.* Created by amosli on 14-7-9.

07.*/

08.public class Queue {

09.

10.//使用链表实现队列

11.private LinkedList queueList = new LinkedList();

12.

13.

14.//入队列

15.public void enQueue(Object object) {

16.queueList.addLast(object);

17.}

18.

19.//出队列

20.public Object deQueue() {

21.return queueList.removeFirst();

22.}

23.

24.//判断队列是否为空

25.public boolean isQueueEmpty() {

26.return queueList.isEmpty();

27.}

28.

29.//判断队列是否包含ject元素..

30.public boolean contains(Object object) {

31.return queueList.contains(object);

32.}

33.

34.//判断队列是否为空

35.public boolean empty() {

36.return queueList.isEmpty();

37.}

38.

39.}

网页链接进出队列的管理:LinkQueue.java

view source print ?

01.package com.amos.crawl;

02.

03.import java.util.HashSet;

04.import java.util.Set;

05.

06./**

07.* Created by amosli on 14-7-9.

08.*/

09.public class LinkQueue {

10.//已经访问的队列

11.private static Set visitedUrl = new HashSet();

12.//未访问的队列

13.private static Queue unVisitedUrl = new Queue();

14.

15.//获得URL队列

16.public static Queue getUnVisitedUrl() {

17.return unVisitedUrl;

18.}

19.public static Set getVisitedUrl() {

20.return visitedUrl;

21.}

22.//添加到访问过的URL队列中

23.public static void addVisitedUrl(String url) {

24.visitedUrl.add(url);

25.}

26.

27.//删除已经访问过的URL

28.public static void removeVisitedUrl(String url){

29.visitedUrl.remove(url);

30.}

31.//未访问的URL出队列

32.public static Object unVisitedUrlDeQueue(){

33.return unVisitedUrl.deQueue();

34.}

35.//保证每个URL只被访问一次,url不能为空,同时已经访问的URL队列中不能包含该url,而且因为已经出队列了所未访问的URL队列中也不能包含该url

36.public static void addUnvisitedUrl(String url){

37.if(url!=null&&!url.trim().equals("")&&!visitedUrl.contains(url)&&!unVisitedUrl.contains(url))

38.unVisitedUrl.enQueue(url);

39.}

40.//获得已经访问过的URL的数量

41.public static int getVisitedUrlNum(){

42.return visitedUrl.size();

43.}

44.

45.//判断未访问的URL队列中是否为空

46.public static boolean isUnvisitedUrlsEmpty(){

47.return unVisitedUrl.empty();

48.}

49.}

抓取思路是:首先给出要抓取的url==>查询符合条件的url,并将其加入到队列中==>按顺序取出队列中的url,并访问之,同时取出符合条件的url==>下载队列中的url网页,即按层探索,最多限制100条数据.

3.3　截图

一.使用的技术

二.爬虫基本知识

1.什么是网络爬虫?(爬虫的基本原理)

2.常用的爬虫策略有哪些?

3.本文爬虫示例图

三.广度优先爬虫示例

1.需求:抓取复旦新闻信息(只抓取100个网页信息)

2.代码实现

3.3 截图

相关推荐

3.3　截图