scrapy爬虫

感觉现在爬虫很火，搜索有关于python类容总是弹出关于爬虫的信息，我也感觉爬虫这个东西很有意思所有花了一点时间来学习爬虫，现在简单的记录一下爬虫的环境安装，爬虫爬取过程，静态页面爬取案例，动态加载页面爬取，存储数据库。使用scrapy框架来爬取数据会很简洁，也比较易学，我使用的是scrapy框架爬取数据。

1.环境配置

（1）安装python3.4，3.5，3.6都行，（2）安装scrapy(pip install scrapy),（3）安装Twisted，可以使用pip安装如果安装不了，就下载.whl文件安装，（4）安装pypewin32(pip install pypiwin32)

2.爬虫爬取过程

scrapy爬虫

Scrapy介绍：

引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心)
调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL（抓取网页的网址或者说是链接）的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后，将被发送到项目管道，并经过几个特定的次序处理数据。
下载器中间件(Downloader Middlewares)
位于Scrapy引擎和下载器之间的框架，主要是处理Scrapy引擎与下载器之间的请求及响应。
爬虫中间件(Spider Middlewares)
介于Scrapy引擎和爬虫之间的框架，主要工作是处理蜘蛛的响应输入和请求输出。
调度中间件(Scheduler Middewares)
介于Scrapy引擎和调度之间的中间件，从Scrapy引擎发送到调度的请求和响应。

3.爬虫案例：

爬取腾讯招聘网 https://hr.tencent.com/position.php?&start=0

（1）创建一个scrapy工程

scrapy startproject Testspider

（2）创建一个爬虫

scrapy genspider Tencent "tencent.com"（Tencent爬虫名，tencent.com爬取的网站

（3）写爬虫

编写item字段：

import scrapy
class TraintencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    positionname = scrapy.Field()  # 职位名
    positionlink = scrapy.Field()  # 职位详情链接
    positiontype = scrapy.Field()  # 职位类别
    peoplenumber = scrapy.Field()  # 招人数
    worklocation = scrapy.Field()#工作地点
    publishtime = scrapy.Field()#发布时间

编写爬虫

# -*- coding: utf-8 -*-
import scrapy
from Traintencent.items import TraintencentItem#导入类

class TencentSpider(scrapy.Spider):
    name = 'tencent'#爬虫名
    allowed_domains = ['tencent.com']#限制爬虫爬取的范围
    #start_urls = ['https://hr.tencent.com/position.php?&start=0']
    start_urls = ['https://hr.tencent.com/position.php?&start=0']#爬取的网页
    base = 'https://hr.tencent.com/'
    count = 0
    def parse(self, response):
        node_list = response.xpath("//tr[@class='even'] | //tr[@class='odd']")#从xpath取出类容
        for node in node_list:
            item = TraintencentItem()
            if node.xpath("./td[1]/a/text()").extract():#有时候网页里边没有职位名称，返回none会他报错
                item['positionname'] = node.xpath("./td[1]/a/text()").extract()[0]  # extract后是一个列表
            else:
                item['positionname'] = "NULL"
            item['positionlink'] = node.xpath("./td[1]/a/@href").extract()[0]
            if node.xpath("./td[2]/text()").extract():
                item['positiontype'] = node.xpath("./td[2]/text()").extract()[0]
            else:
                item['positiontype'] = "NULL"
            item['peoplenumber'] = node.xpath("./td[3]/text()").extract()[0]
            item['worklocation'] = node.xpath("./td[4]/text()").extract()[0]
            item['publishtime'] = node.xpath("./td[5]/text()").extract()[0]
            yield item
        self.count = self.count+1
        if self.count>5 or response.xpath('//a[@class ="noactive" and @ id="next"]').extract():#提取下一页的ur
            return
        else:
            url = response.xpath('//a[@id="next"]/@href').extract()[0]#取下一个url的链接
            url = self.base+url
            yield scrapy.Request(url,callback=self.parse)#再次回掉parse函数，爬取传送的url中的类容

编写管道文件对爬取的数据进行保存：

import json

class TraintencentPipeline(object):
    def __init__(self):
        self.f = open("tencent.json", "w",encoding='utf-8')#调用管道文件的时候打开

    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False) + ',\n'#存储为json文件
        # content = content.encode('utf-8')
        self.f.write(content)
        return item

    def close_spider(self,spider):
        self.f.close(）

编写settings文件：

开启pipeline

4.爬取类容存储到mysql数据库：

前面的爬虫编写都一样，现在只需要修改pipelines文件将爬取的数据存储到数据库：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from scrapy.utils.project import get_project_settings
class MySQLPipeline(object):
    def __init__(self):
        self.id = 1
    def connect_db(self):
        # 从settings.py文件中导入数据库连接需要的相关信息
        settings = get_project_settings()

        self.host = settings['DB_HOST']
        self.port = settings['DB_PORT']
        self.user = settings['DB_USER']
        self.password = settings['DB_PASSWORD']
        self.name = settings['DB_NAME']
        self.charset = settings['DB_CHARSET']#编码格式

        # 连接数据库
        self.conn = pymysql.connect(
            host = self.host,
            port = self.port,
            user = self.user,
            password = self.password,
            db = self.name,  # 数据库名
            charset = self.charset,
        )

        # 操作数据库的对象
        self.cursor = self.conn.cursor()#创建一个cursor对象，它包含数据库操作方法

    # 连接数据库
    def open_spider(self, spider):
        self.connect_db()

    # 关闭数据库连接
    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

    # 写入数据库
    def process_item(self, item, spider):
        # 写入数据库内容
        # 这里根据需求自行设置要写入的字段及值
        tempt = dict(item)
        self.id = self.id + 1
        sql = "insert into tencentwork values('%s','%s','%s','%s','%s','%s','%s');" %(str(self.id),tempt['positionname'],
        tempt['positiontype'],tempt['peoplenumber'],tempt['worklocation'],tempt['publishtime'],tempt['positionlink'])
        print(sql)
        # sql = "insert into tencentwork(positionname) values(%s);"% str(item['positionname'])
        # 执行sql语句
        self.cursor.execute(sql)
        # 需要强制提交数据，否则数据回滚之后，数据库为空
        self.conn.commit()
        return item

配置settings：

ITEM_PIPELINES = {
   # 'Traintencent.pipelines.TraintencentPipeline': 300,
   'Traintencent.pipelines.MySQLPipeline':200,#设置较高优先级
}

DB_HOST = 'localhost'
DB_PORT = 3306
DB_USER = 'root'
DB_PASSWORD = 'hexiong'
DB_NAME = 'hexiong'
DB_CHARSET = 'utf8'

5.动态页面爬取

所谓动态页面指的是为了加速页面加载的速度，页面很多部分是用js生成的，而对于scrapy爬虫，scrapy没有js engine，所以爬取的页面都是静态的，js动态生成的页面无法获得。可以利用第三方中间件来提供网页js的渲染服务。

Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器，Splash是用Python实现的，同时使用Twisted和QT。Twisted（QT）用来让服务具有异步处理能力，以发挥webkit的并发能力。

（1）环境配置

1）安装docker toolbox http://get.daocloud.io/#install-docker-for-mac-windows （下载）

2）打开Docker Quickstart Terminal会自动安装一些东西

3）docker pull scrapyinghub/splash(拉去镜像会下载一个大的文件，默认从国外镜像下载速度很慢，可以配置国内镜像，详细配置参考：https://blog.****.net/qq_38003892/article/details/80191572

4）开启splash 命令： docker run -p 8050:8050 scrapinghub/splash

5)在浏览器中输入网址：192.168.99.700：8050

6）安装scrapy_splash(pip install scrapy_splash)

（2）爬取案例：

编写爬虫

SplashRequest提供一个渲染，页面所有类容包括js动态加载的部分都爬取下来，可以从response中获取动态类容。

import scrapy
from scrapy_splash import SplashRequest
class DynamicWebspider(scrapy.Spider):
    name = "dynamicweb"
    start_urls = ["http://39.104.87.35/findex/"]
    base = "http://39.104.87.35"
    def start_requests(self):
        yield SplashRequest(url=self.start_urls[0], callback=self.parse, args={'wait': 1})

    def parse(self, response):
        print(response.xpath('//ul[@id="updatelist"]/li//a/text()').extract())

编写settings文件

SPIDER_MIDDLEWARES = {
   'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# 渲染服务的url
SPLASH_URL = 'http://192.168.99.100:8050'

DOWNLOADER_MIDDLEWARES = {
   'scrapy_splash.SplashCookiesMiddleware': 723,
   'scrapy_splash.SplashMiddleware': 725,
   'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# 去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 使用Splash的Http缓存
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

学习的过程中，参考了几篇写的很好的博客：

https://blog.****.net/qq_38003892/article/details/80284160

https://blog.****.net/qq_38003892/article/details/80191572

https://www.cnblogs.com/shaosks/p/6950358.html

1.环境配置

2.爬虫爬取过程

3.爬虫案例：

4.爬取类容存储到mysql数据库：

5.动态页面爬取

相关推荐