scrapy从数据获取到数据入库(有小技巧哟!)

抓取目标网址:https://www.cn357.com/notice_300https://www.cn357.com/notice_191

由于该网站没有设置反爬,所以直接干!

需要抓取的数据:

scrapy从数据获取到数据入库(有小技巧哟!)

以上是车辆信息列表

接下来是车辆详细信息:

scrapy从数据获取到数据入库(有小技巧哟!)

scrapy从数据获取到数据入库(有小技巧哟!)

抓取的信息包括所有车辆的详细信息和车辆的图片。

首先,建立好工程:

scrapy从数据获取到数据入库(有小技巧哟!)

接下来我们在items里写好需要的数据项:

import scrapy


class ShangchewangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    myurl = scrapy.Field()
    mytime = scrapy.Field()
    # 公告型号
    announcement_model = scrapy.Field()
    # 公告批次
    announcement_batch = scrapy.Field()
    # 品牌
    brand = scrapy.Field()
    # 类型
    car_type = scrapy.Field()
    # 额定质量
    rated_quality = scrapy.Field()
    # 总质量
    total_quality = scrapy.Field()
    # 整备质量
    Curing_quality = scrapy.Field()
    # 燃料种类
    fuel_type = scrapy.Field()
    # 排放依据标准
    emission_standard = scrapy.Field()
    # 轴数
    number_of_axes = scrapy.Field()
    # 轴距
    wheelbase = scrapy.Field()
    # 轴荷
    axle_load = scrapy.Field()
    # 弹簧片数
    number_of_spring = scrapy.Field()
    # 轮胎数
    number_of_tire = scrapy.Field()
    # 轮胎规格
    standard_tire = scrapy.Field()
    # 接近离去角
    leave_angle = scrapy.Field()
    # 前悬后悬
    QianHouXuan = scrapy.Field()
    # 前轮距
    before_tire_distance = scrapy.Field()
    # 后轮距
    back_tire_distance = scrapy.Field()
    # 识别代号
    identification_number = scrapy.Field()
    # 整车长
    car_lange = scrapy.Field()
    # 整车宽
    car_width = scrapy.Field()
    # 整车高
    car_hight = scrapy.Field()
    # 货厢长
    container_lang = scrapy.Field()
    # 货厢宽
    container_width = scrapy.Field()
    # 货厢高
    container_hight = scrapy.Field()
    # 最高车速
    highest_speed = scrapy.Field()
    # 额定载客
    rated_passenger = scrapy.Field()
    # 驾驶室准乘人数
    cab_people_number = scrapy.Field()
    # 转向形式
    turn_type = scrapy.Field()
    # 准拖挂车总质量
    hang_car_all_quality = scrapy.Field()
    # 载质量利用系数
    modulus = scrapy.Field()
    # 半挂车鞍座最大承载质量
    must_quality = scrapy.Field()
    # 企业名称
    firm_name = scrapy.Field()
    # 企业地址
    firm_address = scrapy.Field()
    # 电话号码
    TLE = scrapy.Field()
    # 传真号码
    fax = scrapy.Field()
    # 邮政编码
    postal_code = scrapy.Field()
    # 底盘1
    chassis_one = scrapy.Field()
    # 底盘2
    chassis_tow = scrapy.Field()
    # 底盘3
    chassis_thress = scrapy.Field()
    # 底盘4
    chassis_four = scrapy.Field()
    # 发动机型号
    engine_model = scrapy.Field()
    # 发动机生产企业
    engine_firm = scrapy.Field()
    # 发动机商标
    engine_brand = scrapy.Field()
    # 排量
    displacement = scrapy.Field()
    # 功率
    power = scrapy.Field()
    # 备注
    remark = scrapy.Field()
    # 图片

某些字段确实太长了,不好命名。但是小松鼠还是秉承着良好的命名习惯。

接着我们稍微分析一下抓取的目标网页:

我们点击下一页看看网站是如何实现翻页的:

第二页为:

scrapy从数据获取到数据入库(有小技巧哟!)

我们发现网站似乎是直接在网站后面添加页数实现翻页的,接下来验证我们的猜想,我们试试第三页https://www.cn357.com/notice_300_3和第一页https://www.cn357.com/notice_300_1:scrapy从数据获取到数据入库(有小技巧哟!)

scrapy从数据获取到数据入库(有小技巧哟!)

可能很多小伙伴很疑惑,第三页已经验证我们的猜想是正确的了,为什么还要验证第一页呢?这就是经验之谈了,很多网站的首页是不符合翻页规律的,所以我们要习惯性的去验证一下首页的网址。

找到翻页规律之后,我们再看看车辆详细信息的格式是不是一致的,因为这会决定我们代码的结构。由于车辆详细信息的格式和上图抓取目标是一致的,小松鼠就不贴图了。但是小伙伴们以后分析抓取目标的时候,记得多分析目标数据结构哟。

准备写爬虫:

通过命令scrapy genspider 爬虫名 “目标网址”,来创建一个爬虫。

scrapy从数据获取到数据入库(有小技巧哟!)

这就是scrapy为我们创建的小蜘蛛,虽然只有一些字母,但是还是觉得十分可爱哟!针对这个小蜘蛛,小松鼠说几点:一、第一行注释部分,这对于Python2+版本非常重要,特别是涉及到爬虫内容有中文时;二、一般情况下,我们都是不需要这行代码的,这行代码限制爬虫只能访问这个网址。

allowed_domains = ["https://www.cn357.com/notice_300"]

第三、start_urls是整个爬虫的起始点,小蜘蛛们会去访问start_urls里面的网址,但是我们希望小蜘蛛们的出发点是去访问车辆列表,也就是每一页的网址。我们可以重写start_requests这个方法,重写之后,改变小蜘蛛们的出发点:

    def start_requests(self):
        urls_191 = ['https://www.cn357.com/notice_191_%d' % d for d in range(1, 15)]
        urls_300 = ['https://www.cn357.com/notice_300_%d' % d for d in range(1, 86)]
        start_urls = urls_300
        for url in start_urls:
            yield scrapy.Request(url, callback=self.parse)

接下来我们在车辆列表里获取每一辆车的url,这里会用到scrapy内置的选择器xpath,具体的xpath语法大家可以去w3c上面去学习哟。

scrapy从数据获取到数据入库(有小技巧哟!)

另外,教大家一个去除网页广告的小技巧:下面这张图我们发现有明显的广告,像这样的广告,在一些羞羞的网站里特别多,影响我们广大男同胞的体验,接下来见证奇迹。

scrapy从数据获取到数据入库(有小技巧哟!)

ctrl + u 打开谷歌浏览器开发者模式

scrapy从数据获取到数据入库(有小技巧哟!)

可以看到这个广告就是一个图片,我们只需要选中这个img标签,然后按住键盘的Delete就完美的把广告出掉了。

回归主题,我们发现车辆信息只有链接,不是完整的url,所以我们需要自己构建,点击查看几个车辆信息,发现每个车辆信息的链接是由https://www.cn357.com加上我们找到的链接。上代码:

    def parse(self, response):
        links = response.xpath('//div[@class="gMain"]/table[1]//a/@href').extract()
        for link in links:
            url = 'https://www.cn357.com' + link
            yield scrapy.Request(url, callback=self.parse_content)

然后我们把每个标签对应的数据找到:

    def parse_content(self, response):
        item = ShangchewangItem()
        item['myurl'] = response.url
        item['mytime'] = time.strftime('%Y-%m/%d %H:%M:%S',time.localtime(time.time()))
        item['announcement_model'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[2]/td[2]//text()').extract())
        item['announcement_batch'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[2]/td[4]/text()').extract())
        item['brand'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[3]/td[2]//text()').extract())
        item['car_type'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[3]/td[4]//text()').extract())
        item['rated_quality'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[4]/td[2]//text()').extract())
        item['total_quality'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[4]/td[4]//text()').extract())
        item['Curing_quality'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[5]/td[2]//text()').extract())
        item['fuel_type'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[5]/td[4]//text()').extract())
        item['emission_standard'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[6]/td[2]//text()').extract())
        item['number_of_axes'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[6]/td[4]//text()').extract())
        item['wheelbase'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[7]/td[2]//text()').extract())
        item['axle_load'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[7]/td[4]//text()').extract())
        item['number_of_spring'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[8]/td[2]//text()').extract())
        item['number_of_tire'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[8]/td[4]//text()').extract())
        item['standard_tire'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[9]/td[2]//text()').extract())
        item['leave_angle'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[9]/td[4]//text()').extract())
        item['QianHouXuan'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[10]/td[2]//text()').extract())
        item['before_tire_distance'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[10]/td[4]//text()').extract())
        item['back_tire_distance'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[11]/td[2]//text()').extract())
        item['identification_number'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[11]/td[4]//text()').extract())
        item['car_lange'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[12]/td[2]//text()').extract())
        item['car_width'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[12]/td[4]//text()').extract())
        item['car_hight'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[13]/td[2]//text()').extract())
        item['container_lang'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[13]/td[4]//text()').extract())
        item['container_width'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[14]/td[2]//text()').extract())
        item['container_hight'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[14]/td[4]//text()').extract())
        item['highest_speed'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[15]/td[2]//text()').extract())
        item['rated_passenger'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[15]/td[4]//text()').extract())
        item['cab_people_number'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[16]/td[2]//text()').extract())
        item['turn_type'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[16]/td[4]//text()').extract())
        item['hang_car_all_quality'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[17]/td[2]//text()').extract())
        item['modulus'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[17]/td[4]//text()').extract())
        item['must_quality'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[18]/td[2]//text()').extract())
        item['firm_name'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[18]/td[4]//text()').extract())
        item['firm_address'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[19]/td[2]//text()').extract())
        item['TLE'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[19]/td[4]//text()').extract())
        item['fax'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[20]/td[2]//text()').extract())
        item['postal_code'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[20]/td[4]//text()').extract())
        item['chassis_one'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[21]/td[2]//text()').extract())
        item['chassis_tow'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[21]/td[4]//text()').extract())
        item['chassis_thress'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[22]/td[2]//text()').extract())
        item['chassis_four'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[22]/td[4]//text()').extract())
        item['engine_model'] = ','.join(response.xpath('//td[@colspan="4"]/table//tr[2]/td[1]/text()').extract())
        item['engine_firm'] = ','.join(response.xpath('//td[@colspan="4"]/table//tr[2]/td[2]/text()').extract())
        item['engine_brand'] = ','.join(response.xpath('//td[@colspan="4"]/table//tr[2]/td[3]/text()').extract())
        item['displacement'] = ','.join(response.xpath('//td[@colspan="4"]/table//tr[2]/td[4]//text()').extract())
        item['power'] = ','.join(response.xpath('//td[@colspan="4"]/table//tr[2]/td[5]/text()').extract())
        item['remark'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[24]/td[2]//text()').extract()).strip()
        item['img'] = ';'.join(response.xpath('//ul[@class="clear"]//li//img/@src').extract())
        yield item

这里说一点,在导入items时的时候,通常大家是按照路径名导入的,但是这个在pycharm中通常会报错,所以我比较喜欢采用这样的形式:

from ..items import ShangchewangItem

然后我们要去pipelines里处理我们获得的数据。这个时候,我们应该先去setting里面逛逛。在setting里将robotstxt_obey改为false,小松鼠一般会将延迟设为一秒:

DOWNLOAD_DELAY = 1

然后启动ITEM_PIPELINES.

由于我们最后需要将获取的数据写入数据库,数据部分,我们将其封装到item里,所以我们在item里写个函数:

    def get_insert_sql(self):
        insert_sql = '''
        insert into cn357(
        myurl,
        mytime,
        announcement_model,
        announcement_batch,
        brand,
        car_type,
        rated_quality,
        total_quality,
        Curing_quality,
        fuel_type,
        emission_standard,
        number_of_axes,
        wheelbase,
        axle_load,
        number_of_spring,
        number_of_tire,
        standard_tire,
        leave_angle,
        QianHouXuan,
        before_tire_distance,
        back_tire_distance,
        identification_number,
        car_lange,
        car_width,
        car_hight,
        container_lang,
        container_width,
        container_hight,
        highest_speed,
        rated_passenger,
        cab_people_number,
        turn_type,
        hang_car_all_quality,
        modulus,
        must_quality,
        firm_name,
        firm_address,
        TLE,
        fax,
        postal_code,
        chassis_one,
        chassis_tow,
        chassis_thress,
        chassis_four,
        engine_model,
        engine_firm,
        engine_brand,
        displacement,
        power,
        remark,
        img
        )VALUES
        (
        '%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s'
        );
        '''%(
            self['myurl'],
            self['mytime'],
            self['announcement_model'],
            self['announcement_batch'],
            self['brand'],
            self['car_type'],
            self['rated_quality'],
            self['total_quality'],
            self['Curing_quality'],
            self['fuel_type'],
            self['emission_standard'],
            self['number_of_axes'],
            self['wheelbase'],
            self['axle_load'],
            self['number_of_spring'],
            self['number_of_tire'],
            self['standard_tire'],
            self['leave_angle'],
            self['QianHouXuan'],
            self['before_tire_distance'],
            self['back_tire_distance'],
            self['identification_number'],
            self['car_lange'],
            self['car_width'],
            self['car_hight'],
            self['container_lang'],
            self['container_width'],
            self['container_hight'],
            self['highest_speed'],
            self['rated_passenger'],
            self['cab_people_number'],
            self['turn_type'],
            self['hang_car_all_quality'],
            self['modulus'],
            self['must_quality'],
            self['firm_name'],
            self['firm_address'],
            self['TLE'],
            self['fax'],
            self['postal_code'],
            self['chassis_one'],
            self['chassis_tow'],
            self['chassis_thress'],
            self['chassis_four'],
            self['engine_model'],
            self['engine_firm'],
            self['engine_brand'],
            self['displacement'],
            self['power'],
            self['remark'],
            self['img']
        )
        return insert_sql

代码丑了一点,小松鼠也在想如何优化它,请大家多多指教。

先上pipelines的代码:

import pymysql

class ShangchewangPipeline(object):

    def __init__(self):
        self.conn = pymysql.connect(host="127.0.0.1", user='**', passwd='**', db="**", charset='utf8')
        self.cursor = self.conn.cursor()

    def do_insert(self, item):
        insert_sql = item.get_insert_sql()
        self.cursor.execute(insert_sql)
        self.conn.commit()

    def process_item(self, item, spider):
        self.do_insert(item)
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

数据库大家自己配好,在do_insert函数里调用在item里封装的sql数据。最后,close_spider函数会在爬虫关闭的时候调用,这个时候也是我们释放资源的时候。最后给大家看看数据库的数据:

scrapy从数据获取到数据入库(有小技巧哟!)

程序还没跑完,但是现在已经有将近5000条数据了,考虑到速度的话,可以试试异步爬虫和scrapy+redis分布式爬虫。

后期小松鼠会将获取的图片存在本地,并将图片的本地地址存入数据库。