用python做一个简单爬虫

首先是最基础的爬虫，爬百度网页

import urllib2
def getHtml(url):
    req=urllib2.Request(url);#建立一个Request对象
    response=urllib2.urlopen(req)#打开网址
    html=response.read()#读取网页
    print html
def main():
    getHtml('http://www.baidu.com');
main()

但是这样的爬虫在爬取网页的时候并没有伪装成人类，网站知道这是机器。某些网站会屏蔽爬虫

比如

http://zccx.tyb.njupt.edu.cn/

上一个程序是爬不出来的

我们访问一个网站抓包后发现还有一点header要加进去。网站一般通过User-Agent识别用户类型，

在包头部加入

User-Agent：Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)

这样一句话就能假装成人类了

修改后程序如下

import urllib2
def getHtml(url):
    headers = \
        {
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)'
        }
    req=urllib2.Request(url,headers=headers);#建立一个Request对象
    response=urllib2.urlopen(req)#打开网址
    html=response.read()#读取网页
    print html
def main():
    getHtml('http://secure.verycd.com/signin/*/http://www.verycd.com/');
main()

用python做一个简单爬虫

相关推荐