网络爬虫
-
一.使用的技术
这个爬虫是近半个月前学习爬虫技术的一个小例子,比较简单,怕时间久了会忘,这里简单总结一下.主要用到的外部Jar包有HttpClient4.3.4,HtmlParser2.1,使用的开发工具(IDE)为intelij 13.1,Jar包管理工具为Maven,不习惯用intelij的同学,也可以使用eclipse新建一个项目.
二.爬虫基本知识
1.什么是网络爬虫?(爬虫的基本原理)
网络爬虫,拆开来讲,网络即指互联网,互联网就像一个蜘蛛网一样,爬虫就像是蜘蛛一样可以到处爬来爬去,把爬来的数据再进行加工处理.
百科上的解释:网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁,自动索引,模拟程序或者蠕虫。
基本原理:传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件,流程图所示。聚焦爬虫的工作流程较为复杂,需要根据一定的网页分析算法过滤与主题无关的链接,保留有用的链接并将其放入等待抓取的URL队列。然后,它将根据一定的搜索策略从队列中选择下一步要抓取的网页URL,并重复上述过程,直到达到系统的某一条件时停止
2.常用的爬虫策略有哪些?
网页的抓取策略可以分为深度优先、广度优先和最佳优先三种。深度优先在很多情况下会导致爬虫的陷入(trapped)问题,目前常见的是广度优先和最佳优先方法。
2.1广度优先(Width-First)
广度优先遍历是连通图的一种遍历策略。因为它的思想是从一个顶点V0开始,辐射状地优先遍历其周围较广的区域,故得名.
其基本思想:
1)、从图中某个顶点V0出发,并访问此顶点; 2)、从V0出发,访问V0的各个未曾访问的邻接点W1,W2,…,Wk;然后,依次从W1,W2,…,Wk出发访问各自未被访问的邻接点; 3)、重复步骤2,直到全部顶点都被访问为止。如下图所示:
2.2深度优先(Depth-First)
假设初始状态是图中所有顶点都未被访问,则深度优先搜索方法的步骤是: 1)选取图中某一顶点Vi为出发点,访问并标记该顶点; 2)以Vi为当前顶点,依次搜索Vi的每个邻接点Vj,若Vj未被访问过,则访问和标记邻接点Vj,若Vj已被访问过,则搜索Vi的下一个邻接点; 3)以Vj为当前顶点,重复步骤2,直到图中和Vi有路径相通的顶点都被访问为止; 4)若图中尚有顶点未被访问过(非连通的情况下),则可任取图中的一个未被访问的顶点作为出发点,重复上述过程,直至图中所有顶点都被访问。下面以一个有向图和一个无向图为例:
广度和深度和区别:
广度优先遍历是以层为顺序,将某一层上的所有节点都搜索到了之后才向下一层搜索;而深度优先遍历是将某一条枝桠上的所有节点都搜索到了之后,才转向搜索另一条枝桠上的所有节点。
2.3 最佳优先搜索
最佳优先搜索策略按照一定的网页分析算法,预测候选URL与目标网页的相似度,或与主题的相关性,并选取评价最好的一个或几个URL进行抓取。它只访问经过网页分析算法预测为“有用”的网页。这种搜索适合暗网数据的爬取,只要符合要求的内容.
3.本文爬虫示例图
本文介绍的例子是抓取新闻类的信息,因为一般新闻类的信息,重要的和时间近的都会放在首页,处在网络层中比较深的信息的重要性一般将逐级降低,所以广度优先算法更适合,下图是本文将要抓取的网页结构图:
三.广度优先爬虫示例
1.需求:抓取复旦新闻信息(只抓取100个网页信息)
这里只抓取100条信息,并用url必须以new.fudan.edu.cn开头.
2.代码实现
使用maven引入外部jar包:
01.
<dependency>
02.
<groupId>org.apache.httpcomponents</groupId>
03.
<artifactId>httpclient</artifactId>
04.
<version>
4.3
.
4
</version>
05.
</dependency>
06.
<dependency>
07.
<groupId>org.htmlparser</groupId>
08.
<artifactId>htmlparser</artifactId>
09.
<version>
2.1
</version>
10.
</dependency>
程序主入口:
01.
package
com.amos.crawl;
02.
03.
import
java.util.Set;
04.
05.
/**
06.
* Created by amosli on 14-7-10.
07.
*/
08.
public
class
MyCrawler {
09.
/**
10.
* 使用种子初始化URL队列
11.
*
12.
* @param seeds
13.
*/
14.
private
void
initCrawlerWithSeeds(String[] seeds) {
15.
for
(
int
i =
0
; i < seeds.length; i++) {
16.
LinkQueue.addUnvisitedUrl(seeds[i]);
17.
}
18.
}
19.
20.
public
void
crawling(String[] seeds) {
21.
//定义过滤器,提取以http://news.fudan.edu.cn/的链接
22.
LinkFilter filter =
new
LinkFilter() {
23.
@Override
24.
public
boolean
accept(String url) {
25.
if
(url.startsWith(
"http://news.fudan.edu.cn"
)) {
26.
return
true
;
27.
}
28.
return
false
;
29.
}
30.
};
31.
//初始化URL队列
32.
initCrawlerWithSeeds(seeds);
33.
34.
int
count=
0
;
35.
//循环条件:待抓取的链接不为空抓取的网页最多100条
36.
while
(!LinkQueue.isUnvisitedUrlsEmpty() && LinkQueue.getVisitedUrlNum() <=
100
) {
37.
38.
System.out.println(
"count:"
+(++count));
39.
40.
//附头URL出队列
41.
String visitURL = (String) LinkQueue.unVisitedUrlDeQueue();
42.
DownLoadFile downloader =
new
DownLoadFile();
43.
//下载网页
44.
downloader.downloadFile(visitURL);
45.
//该URL放入怩访问的URL中
46.
LinkQueue.addVisitedUrl(visitURL);
47.
//提取出下载网页中的URL
48.
Set<String> links = HtmlParserTool.extractLinks(visitURL, filter);
49.
50.
//新的未访问的URL入列
51.
for
(String link : links) {
52.
System.out.println(
"link:"
+link);
53.
LinkQueue.addUnvisitedUrl(link);
54.
}
55.
}
56.
57.
}
58.
59.
public
static
void
main(String args[]) {
60.
//程序入口
61.
MyCrawler myCrawler =
new
MyCrawler();
62.
myCrawler.crawling(
new
String[]{
"http://news.fudan.edu.cn/news/"
});
63.
}
64.
65.
}
工具类:Tools.java
001.
package
com.amos.tool;
002.
003.
import
java.io.*;
004.
import
java.net.URI;
005.
import
java.net.URISyntaxException;
006.
import
java.net.UnknownHostException;
007.
import
java.security.KeyManagementException;
008.
import
java.security.KeyStoreException;
009.
import
java.security.NoSuchAlgorithmException;
010.
import
java.security.cert.CertificateException;
011.
import
java.security.cert.X509Certificate;
012.
import
java.util.Locale;
013.
014.
import
javax.net.ssl.SSLContext;
015.
import
javax.net.ssl.SSLException;
016.
017.
import
org.apache.http.*;
018.
import
org.apache.http.client.CircularRedirectException;
019.
import
org.apache.http.client.CookieStore;
020.
import
org.apache.http.client.HttpRequestRetryHandler;
021.
import
org.apache.http.client.RedirectStrategy;
022.
import
org.apache.http.client.config.RequestConfig;
023.
import
org.apache.http.client.methods.HttpGet;
024.
import
org.apache.http.client.methods.HttpHead;
025.
import
org.apache.http.client.methods.HttpUriRequest;
026.
import
org.apache.http.client.methods.RequestBuilder;
027.
import
org.apache.http.client.protocol.HttpClientContext;
028.
import
org.apache.http.client.utils.URIBuilder;
029.
import
org.apache.http.client.utils.URIUtils;
030.
import
org.apache.http.conn.ConnectTimeoutException;
031.
import
org.apache.http.conn.HttpClientConnectionManager;
032.
import
org.apache.http.conn.ssl.SSLConnectionSocketFactory;
033.
import
org.apache.http.conn.ssl.SSLContextBuilder;
034.
import
org.apache.http.conn.ssl.TrustStrategy;
035.
import
org.apache.http.cookie.Cookie;
036.
import
org.apache.http.impl.client.*;
037.
import
org.apache.http.impl.conn.BasicHttpClientConnectionManager;
038.
import
org.apache.http.impl.cookie.BasicClientCookie;
039.
import
org.apache.http.protocol.HttpContext;
040.
import
org.apache.http.util.Args;
041.
import
org.apache.http.util.Asserts;
042.
import
org.apache.http.util.TextUtils;
043.
import
org.omg.CORBA.Request;
044.
045.
/**
046.
* Created by amosli on 14-6-25.
047.
*/
048.
public
class
Tools {
049.
050.
051.
/**
052.
* 写文件到本地
053.
*
054.
* @param httpEntity
055.
* @param filename
056.
*/
057.
public
static
void
saveToLocal(HttpEntity httpEntity, String filename) {
058.
059.
try
{
060.
061.
File dir =
new
File(Configuration.FILEDIR);
062.
if
(!dir.isDirectory()) {
063.
dir.mkdir();
064.
}
065.
066.
File file =
new
File(dir.getAbsolutePath() +
"/"
+ filename);
067.
FileOutputStream fileOutputStream =
new
FileOutputStream(file);
068.
InputStream inputStream = httpEntity.getContent();
069.
070.
byte
[] bytes =
new
byte
[
1024
];
071.
int
length =
0
;
072.
while
((length = inputStream.read(bytes)) >
0
) {
073.
fileOutputStream.write(bytes,
0
, length);
074.
}
075.
inputStream.close();
076.
fileOutputStream.close();
077.
}
catch
(Exception e) {
078.
e.printStackTrace();
079.
}
080.
081.
}
082.
083.
/**
084.
* 写文件到本地
085.
*
086.
* @param bytes
087.
* @param filename
088.
*/
089.
public
static
void
saveToLocalByBytes(
byte
[] bytes, String filename) {
090.
091.
try
{
092.
093.
File dir =
new
File(Configuration.FILEDIR);
094.
if
(!dir.isDirectory()) {
095.
dir.mkdir();
096.
}
097.
098.
File file =
new
File(dir.getAbsolutePath() +
"/"
+ filename);
099.
FileOutputStream fileOutputStream =
new
FileOutputStream(file);
100.
fileOutputStream.write(bytes);
101.
//fileOutputStream.write(bytes, 0, bytes.length);
102.
fileOutputStream.close();
103.
}
catch
(Exception e) {
104.
e.printStackTrace();
105.
}
106.
107.
}
108.
109.
/**
110.
* 输出
111.
* @param string
112.
*/
113.
public
static
void
println(String string){
114.
System.out.println(
"string:"
+string);
115.
}
116.
/**
117.
* 输出
118.
* @param string
119.
*/
120.
public
static
void
printlnerr(String string){
121.
System.err.println(
"string:"
+string);
122.
}
123.
124.
125.
/**
126.
* 使用ssl通道并设置请求重试处理
127.
* @return
128.
*/
129.
public
static
CloseableHttpClient createSSLClientDefault() {
130.
try
{
131.
SSLContext sslContext =
new
SSLContextBuilder().loadTrustMaterial(
null
,
new
TrustStrategy() {
132.
//信任所有
133.
public
boolean
isTrusted(X509Certificate[] chain,String authType)
throws
CertificateException {
134.
return
true
;
135.
}
136.
}).build();
137.
138.
SSLConnectionSocketFactory sslsf =
new
SSLConnectionSocketFactory(sslContext);
139.
140.
//设置请求重试处理,重试机制,这里如果请求失败会重试5次
141.
HttpRequestRetryHandler retryHandler =
new
HttpRequestRetryHandler() {
142.
@Override
143.
public
boolean
retryRequest(IOException exception,
int
executionCount, HttpContext context) {
144.
if
(executionCount >=
5
) {
145.
// Do not retry if over max retry count
146.
return
false
;
147.
}
148.
if
(exception
instanceof
InterruptedIOException) {
149.
// Timeout
150.
return
false
;
151.
}
152.
if
(exception
instanceof
UnknownHostException) {
153.
// Unknown host
154.
return
false
;
155.
}
156.
if
(exception
instanceof
ConnectTimeoutException) {
157.
// Connection refused
158.
return
false
;
159.
}
160.
if
(exception
instanceof
SSLException) {
161.
// SSL handshake exception
162.
return
false
;
163.
}
164.
HttpClientContext clientContext = HttpClientContext.adapt(context);
165.
HttpRequest request = clientContext.getRequest();
166.
boolean
idempotent = !(request
instanceof
HttpEntityEnclosingRequest);
167.
if
(idempotent) {
168.
// Retry if the request is considered idempotent
169.
return
true
;
170.
}
171.
return
false
;
172.
}
173.
};
174.
175.
//请求参数设置,设置请求超时时间为20秒,连接超时为10秒,不允许循环重定向
176.
RequestConfig requestConfig = RequestConfig.custom()
177.
.setConnectionRequestTimeout(
20000
).setConnectTimeout(
20000
)
178.
.setCircularRedirectsAllowed(
false
)
179.
.build();
180.
181.
Cookie cookie ;
182.
return
HttpClients.custom().setSSLSocketFactory(sslsf)
183.
.setUserAgent(
"Mozilla/5.0 (X11; <a href="
http:
//www.it165.net/os/oslin/" target="_blank" class="keylink">Linux</a> x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36")
184.
.setMaxConnPerRoute(
25
).setMaxConnPerRoute(
256
)
185.
.setRetryHandler(retryHandler)
186.
.setRedirectStrategy(
new
SelfRedirectStrategy())
187.
.setDefaultRequestConfig(requestConfig)
188.
.build();
189.
190.
}
catch
(KeyManagementException e) {
191.
e.printStackTrace();
192.
}
catch
(NoSuchAlgorithmException e) {
193.
e.printStackTrace();
194.
}
catch
(KeyStoreException e) {
195.
e.printStackTrace();
196.
}
197.
return
HttpClients.createDefault();
198.
}
199.
200.
/**
201.
* 带cookiestore
202.
* @param cookieStore
203.
* @return
204.
*/
205.
206.
public
static
CloseableHttpClient createSSLClientDefaultWithCookie(CookieStore cookieStore) {
207.
try
{
208.
SSLContext sslContext =
new
SSLContextBuilder().loadTrustMaterial(
null
,
new
TrustStrategy() {
209.
//信任所有
210.
public
boolean
isTrusted(X509Certificate[] chain,String authType)
throws
CertificateException {
211.
return
true
;
212.
}
213.
}).build();
214.
215.
SSLConnectionSocketFactory sslsf =
new
SSLConnectionSocketFactory(sslContext);
216.
217.
//设置请求重试处理,重试机制,这里如果请求失败会重试5次
218.
HttpRequestRetryHandler retryHandler =
new
HttpRequestRetryHandler() {
219.
@Override
220.
public
boolean
retryRequest(IOException exception,
int
executionCount, HttpContext context) {
221.
if
(executionCount >=
5
) {
222.
// Do not retry if over max retry count
223.
return
false
;
224.
}
225.
if
(exception
instanceof
InterruptedIOException) {
226.
// Timeout
227.
return
false
;
228.
}
229.
if
(exception
instanceof
UnknownHostException) {
230.
// Unknown host
231.
return
false
;
232.
}
233.
if
(exception
instanceof
ConnectTimeoutException) {
234.
// Connection refused
235.
return
false
;
236.
}
237.
if
(exception
instanceof
SSLException) {
238.
// SSL handshake exception
239.
return
false
;
240.
}
241.
HttpClientContext clientContext = HttpClientContext.adapt(context);
242.
HttpRequest request = clientContext.getRequest();
243.
boolean
idempotent = !(request
instanceof
HttpEntityEnclosingRequest);
244.
if
(idempotent) {
245.
// Retry if the request is considered idempotent
246.
return
true
;
247.
}
248.
return
false
;
249.
}
250.
};
251.
252.
//请求参数设置,设置请求超时时间为20秒,连接超时为10秒,不允许循环重定向
253.
RequestConfig requestConfig = RequestConfig.custom()
254.
.setConnectionRequestTimeout(
20000
).setConnectTimeout(
20000
)
255.
.setCircularRedirectsAllowed(
false
)
256.
.build();
257.
258.
259.
return
HttpClients.custom().setSSLSocketFactory(sslsf)
260.
.setUserAgent(
"Mozilla/5.0 (X11; <a href="
http:
//www.it165.net/os/oslin/" target="_blank" class="keylink">Linux</a> x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36")
261.
.setMaxConnPerRoute(
25
).setMaxConnPerRoute(
256
)
262.
.setRetryHandler(retryHandler)
263.
.setRedirectStrategy(
new
SelfRedirectStrategy())
264.
.setDefaultRequestConfig(requestConfig)
265.
.setDefaultCookieStore(cookieStore)
266.
.build();
267.
268.
}
catch
(KeyManagementException e) {
269.
e.printStackTrace();
270.
}
catch
(NoSuchAlgorithmException e) {
271.
e.printStackTrace();
272.
}
catch
(KeyStoreException e) {
273.
e.printStackTrace();
274.
}
275.
return
HttpClients.createDefault();
276.
}
277.
278.
}
将网页写入到本地的下载类:DownLoadFile.java
001.
package
com.amos.crawl;
002.
003.
import
com.amos.tool.Configuration;
004.
import
com.amos.tool.Tools;
005.
import
org.apache.http.*;
006.
import
org.apache.http.client.HttpClient;
007.
import
org.apache.http.client.HttpRequestRetryHandler;
008.
import
org.apache.http.client.config.RequestConfig;
009.
import
org.apache.http.client.methods.HttpGet;
010.
import
org.apache.http.client.protocol.HttpClientContext;
011.
import
org.apache.http.conn.ClientConnectionManager;
012.
import
org.apache.http.conn.ConnectTimeoutException;
013.
import
org.apache.http.impl.client.AutoRetryHttpClient;
014.
import
org.apache.http.impl.client.DefaultHttpClient;
015.
import
org.apache.http.protocol.HttpContext;
016.
017.
import
javax.net.ssl.SSLException;
018.
import
java.io.*;
019.
import
java.net.UnknownHostException;
020.
021.
022.
/**
023.
* Created by amosli on 14-7-9.
024.
*/
025.
public
class
DownLoadFile {
026.
027.
public
String getFileNameByUrl(String url, String contentType) {
028.
//移除http http://
029.
url = url.contains(
"http://"
) ? url.substring(
7
) : url.substring(
8
);
030.
031.
//text/html类型
032.
if
(url.contains(
".html"
)) {
033.
url = url.replaceAll(
"[\\?/:*|<>\"]"
,
"_"
);
034.
}
else
if
(contentType.indexOf(
"html"
) != -
1
) {
035.
url = url.replaceAll(
"[\\?/:*|<>\"]"
,
"_"
) +
".html"
;
036.
}
else
{
037.
url = url.replaceAll(
"[\\?/:*|<>\"]"
,
"_"
) +
"."
+ contentType.substring(contentType.lastIndexOf(
"/"
) +
1
);
038.
}
039.
return
url;
040.
}
041.
042.
/**
043.
* 将网页写入到本地
044.
* @param data
045.
* @param filePath
046.
*/
047.
private
void
saveToLocal(
byte
[] data, String filePath) {
048.
049.
try
{
050.
DataOutputStream out =
new
DataOutputStream(
new
FileOutputStream(
new
File(filePath)));
051.
for
(
int
i=
0
;i<data.length;i++){
052.
out.write(data[i]);
053.
}
054.
out.flush();
055.
out.close();
056.
057.
}
catch
(Exception e) {
058.
e.printStackTrace();
059.
}
060.
}
061.
062.
/**
063.
* 写文件到本地
064.
*
065.
* @param httpEntity
066.
* @param filename
067.
*/
068.
public
static
void
saveToLocal(HttpEntity httpEntity, String filename) {
069.
070.
try
{
071.
072.
File dir =
new
File(Configuration.FILEDIR);
073.
if
(!dir.isDirectory()) {
074.
dir.mkdir();
075.
}
076.
077.
File file =
new
File(dir.getAbsolutePath() +
"/"
+ filename);
078.
FileOutputStream fileOutputStream =
new
FileOutputStream(file);
079.
InputStream inputStream = httpEntity.getContent();
080.
081.
if
(!file.exists()) {
082.
file.createNewFile();
083.
}
084.
byte
[] bytes =
new
byte
[
1024
];
085.
int
length =
0
;
086.
while
((length = inputStream.read(bytes)) >
0
) {
087.
fileOutputStream.write(bytes,
0
, length);
088.
}
089.
inputStream.close();
090.
fileOutputStream.close();
091.
}
catch
(Exception e) {
092.
e.printStackTrace();
093.
}
094.
095.
}
096.
097.
098.
public
String downloadFile(String url) {
099.
100.
//文件路径
101.
String filePath=
null
;
102.
103.
//1.生成HttpClient对象并设置参数
104.
HttpClient httpClient = Tools.createSSLClientDefault();
105.
106.
//2.HttpGet对象并设置参数
107.
HttpGet httpGet =
new
HttpGet(url);
108.
109.
//设置get请求超时5s
110.
//方法1
111.
//httpGet.getParams().setParameter("connectTimeout",5000);
112.
//方法2
113.
RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(
5000
).build();
114.
httpGet.setConfig(requestConfig);
115.
116.
try
{
117.
HttpResponse httpResponse = httpClient.execute(httpGet);
118.
int
statusCode = httpResponse.getStatusLine().getStatusCode();
119.
if
(statusCode!= HttpStatus.SC_OK){
120.
System.err.println(
"Method failed:"
+httpResponse.getStatusLine());
121.
filePath=
null
;
122.
}
123.
124.
filePath=getFileNameByUrl(url,httpResponse.getEntity().getContentType().getValue());
125.
saveToLocal(httpResponse.getEntity(),filePath);
126.
127.
}
catch
(Exception e) {
128.
e.printStackTrace();
129.
}
130.
131.
return
filePath;
132.
133.
}
134.
135.
136.
137.
public
static
void
main(String args[])
throws
IOException {
138.
String url =
"http://websearch.fudan.edu.cn/search_dep.html"
;
139.
HttpClient httpClient =
new
DefaultHttpClient();
140.
HttpGet httpGet =
new
HttpGet(url);
141.
HttpResponse httpResponse = httpClient.execute(httpGet);
142.
Header contentType = httpResponse.getEntity().getContentType();
143.
144.
System.out.println(
"name:"
+ contentType.getName() +
"value:"
+ contentType.getValue());
145.
System.out.println(
new
DownLoadFile().getFileNameByUrl(url, contentType.getValue()));
146.
147.
}
148.
149.
150.
}
创建一个过滤接口:LinkFilter.java
01.
package
com.amos.crawl;
02.
03.
/**
04.
* Created by amosli on 14-7-10.
05.
*/
06.
public
interface
LinkFilter {
07.
08.
public
boolean
accept(String url);
09.
10.
}
使用HtmlParser的过滤url的方法:HtmlParserTool.java
01.
package
com.amos.crawl;
02.
03.
import
org.htmlparser.Node;
04.
import
org.htmlparser.NodeFilter;
05.
import
org.htmlparser.Parser;
06.
import
org.htmlparser.filters.NodeClassFilter;
07.
import
org.htmlparser.filters.OrFilter;
08.
import
org.htmlparser.tags.LinkTag;
09.
import
org.htmlparser.util.NodeList;
10.
11.
import
java.util.HashSet;
12.
import
java.util.Set;
13.
14.
/**
15.
* Created by amosli on 14-7-10.
16.
*/
17.
public
class
HtmlParserTool {
18.
public
static
Set<String> extractLinks(String url, LinkFilter filter) {
19.
Set<String> links =
new
HashSet<String>();
20.
21.
try
{
22.
Parser parser =
new
Parser(url);
23.
parser.setEncoding(
"GBK"
);
24.
//过滤<frame>标签的filter,用来提取frame标签里的src属性
25.
NodeFilter framFilter =
new
NodeFilter() {
26.
@Override
27.
public
boolean
accept(Node node) {
28.
if
(node.getText().contains(
"frame src="
)) {
29.
return
true
;
30.
}
else
{
31.
return
false
;
32.
}
33.
34.
}
35.
};
36.
37.
//OrFilter来设置过滤<a>标签和<frame>标签
38.
OrFilter linkFilter =
new
OrFilter(
new
NodeClassFilter(LinkTag.
class
), framFilter);
39.
//得到所有经过过滤的标签
40.
NodeList list = parser.extractAllNodesThatMatch(linkFilter);
41.
for
(
int
i =
0
; i < list.size(); i++) {
42.
Node tag = list.elementAt(i);
43.
if
(tag
instanceof
LinkTag) {
44.
tag = (LinkTag) tag;
45.
String linkURL = ((LinkTag) tag).getLink();
46.
47.
//如果符合条件那么将url添加进去
48.
if
(filter.accept(linkURL)) {
49.
links.add(linkURL);
50.
}
51.
52.
}
else
{
//frame 标签
53.
//frmae里src属性的链接,如<frame src="test.html" />
54.
String frame = tag.getText();
55.
int
start = frame.indexOf(
"src="
);
56.
frame = frame.substring(start);
57.
58.
int
end = frame.indexOf(
" "
);
59.
if
(end == -
1
) {
60.
end = frame.indexOf(
">"
);
61.
}
62.
String frameUrl = frame.substring(
5
, end -
1
);
63.
if
(filter.accept(frameUrl)) {
64.
links.add(frameUrl);
65.
}
66.
}
67.
68.
}
69.
70.
}
catch
(Exception e) {
71.
e.printStackTrace();
72.
}
73.
74.
return
links;
75.
}
76.
77.
78.
}
管理网页url的实现队列: Queue.java01.
package
com.amos.crawl;
02.
03.
import
java.util.LinkedList;
04.
05.
/**
06.
* Created by amosli on 14-7-9.
07.
*/
08.
public
class
Queue {
09.
10.
//使用链表实现队列
11.
private
LinkedList queueList =
new
LinkedList();
12.
13.
14.
//入队列
15.
public
void
enQueue(Object object) {
16.
queueList.addLast(object);
17.
}
18.
19.
//出队列
20.
public
Object deQueue() {
21.
return
queueList.removeFirst();
22.
}
23.
24.
//判断队列是否为空
25.
public
boolean
isQueueEmpty() {
26.
return
queueList.isEmpty();
27.
}
28.
29.
//判断队列是否包含ject元素..
30.
public
boolean
contains(Object object) {
31.
return
queueList.contains(object);
32.
}
33.
34.
//判断队列是否为空
35.
public
boolean
empty() {
36.
return
queueList.isEmpty();
37.
}
38.
39.
}
网页链接进出队列的管理:LinkQueue.java
01.
package
com.amos.crawl;
02.
03.
import
java.util.HashSet;
04.
import
java.util.Set;
05.
06.
/**
07.
* Created by amosli on 14-7-9.
08.
*/
09.
public
class
LinkQueue {
10.
//已经访问的队列
11.
private
static
Set visitedUrl =
new
HashSet();
12.
//未访问的队列
13.
private
static
Queue unVisitedUrl =
new
Queue();
14.
15.
//获得URL队列
16.
public
static
Queue getUnVisitedUrl() {
17.
return
unVisitedUrl;
18.
}
19.
public
static
Set getVisitedUrl() {
20.
return
visitedUrl;
21.
}
22.
//添加到访问过的URL队列中
23.
public
static
void
addVisitedUrl(String url) {
24.
visitedUrl.add(url);
25.
}
26.
27.
//删除已经访问过的URL
28.
public
static
void
removeVisitedUrl(String url){
29.
visitedUrl.remove(url);
30.
}
31.
//未访问的URL出队列
32.
public
static
Object unVisitedUrlDeQueue(){
33.
return
unVisitedUrl.deQueue();
34.
}
35.
//保证每个URL只被访问一次,url不能为空,同时已经访问的URL队列中不能包含该url,而且因为已经出队列了所未访问的URL队列中也不能包含该url
36.
public
static
void
addUnvisitedUrl(String url){
37.
if
(url!=
null
&&!url.trim().equals(
""
)&&!visitedUrl.contains(url)&&!unVisitedUrl.contains(url))
38.
unVisitedUrl.enQueue(url);
39.
}
40.
//获得已经访问过的URL的数量
41.
public
static
int
getVisitedUrlNum(){
42.
return
visitedUrl.size();
43.
}
44.
45.
//判断未访问的URL队列中是否为空
46.
public
static
boolean
isUnvisitedUrlsEmpty(){
47.
return
unVisitedUrl.empty();
48.
}
49.
}
抓取思路是:首先给出要抓取的url==>查询符合条件的url,并将其加入到队列中==>按顺序取出队列中的url,并访问之,同时取出符合条件的url==>下载队列中的url网页,即按层探索,最多限制100条数据.
3.3 截图