scrapy从数据获取到数据入库(有小技巧哟!)
抓取目标网址:https://www.cn357.com/notice_300;https://www.cn357.com/notice_191
由于该网站没有设置反爬,所以直接干!
需要抓取的数据:
以上是车辆信息列表
接下来是车辆详细信息:
抓取的信息包括所有车辆的详细信息和车辆的图片。
首先,建立好工程:
接下来我们在items里写好需要的数据项:
import scrapy
class ShangchewangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
myurl = scrapy.Field()
mytime = scrapy.Field()
# 公告型号
announcement_model = scrapy.Field()
# 公告批次
announcement_batch = scrapy.Field()
# 品牌
brand = scrapy.Field()
# 类型
car_type = scrapy.Field()
# 额定质量
rated_quality = scrapy.Field()
# 总质量
total_quality = scrapy.Field()
# 整备质量
Curing_quality = scrapy.Field()
# 燃料种类
fuel_type = scrapy.Field()
# 排放依据标准
emission_standard = scrapy.Field()
# 轴数
number_of_axes = scrapy.Field()
# 轴距
wheelbase = scrapy.Field()
# 轴荷
axle_load = scrapy.Field()
# 弹簧片数
number_of_spring = scrapy.Field()
# 轮胎数
number_of_tire = scrapy.Field()
# 轮胎规格
standard_tire = scrapy.Field()
# 接近离去角
leave_angle = scrapy.Field()
# 前悬后悬
QianHouXuan = scrapy.Field()
# 前轮距
before_tire_distance = scrapy.Field()
# 后轮距
back_tire_distance = scrapy.Field()
# 识别代号
identification_number = scrapy.Field()
# 整车长
car_lange = scrapy.Field()
# 整车宽
car_width = scrapy.Field()
# 整车高
car_hight = scrapy.Field()
# 货厢长
container_lang = scrapy.Field()
# 货厢宽
container_width = scrapy.Field()
# 货厢高
container_hight = scrapy.Field()
# 最高车速
highest_speed = scrapy.Field()
# 额定载客
rated_passenger = scrapy.Field()
# 驾驶室准乘人数
cab_people_number = scrapy.Field()
# 转向形式
turn_type = scrapy.Field()
# 准拖挂车总质量
hang_car_all_quality = scrapy.Field()
# 载质量利用系数
modulus = scrapy.Field()
# 半挂车鞍座最大承载质量
must_quality = scrapy.Field()
# 企业名称
firm_name = scrapy.Field()
# 企业地址
firm_address = scrapy.Field()
# 电话号码
TLE = scrapy.Field()
# 传真号码
fax = scrapy.Field()
# 邮政编码
postal_code = scrapy.Field()
# 底盘1
chassis_one = scrapy.Field()
# 底盘2
chassis_tow = scrapy.Field()
# 底盘3
chassis_thress = scrapy.Field()
# 底盘4
chassis_four = scrapy.Field()
# 发动机型号
engine_model = scrapy.Field()
# 发动机生产企业
engine_firm = scrapy.Field()
# 发动机商标
engine_brand = scrapy.Field()
# 排量
displacement = scrapy.Field()
# 功率
power = scrapy.Field()
# 备注
remark = scrapy.Field()
# 图片
某些字段确实太长了,不好命名。但是小松鼠还是秉承着良好的命名习惯。
接着我们稍微分析一下抓取的目标网页:
我们点击下一页看看网站是如何实现翻页的:
第二页为:
我们发现网站似乎是直接在网站后面添加页数实现翻页的,接下来验证我们的猜想,我们试试第三页https://www.cn357.com/notice_300_3和第一页https://www.cn357.com/notice_300_1:
可能很多小伙伴很疑惑,第三页已经验证我们的猜想是正确的了,为什么还要验证第一页呢?这就是经验之谈了,很多网站的首页是不符合翻页规律的,所以我们要习惯性的去验证一下首页的网址。
找到翻页规律之后,我们再看看车辆详细信息的格式是不是一致的,因为这会决定我们代码的结构。由于车辆详细信息的格式和上图抓取目标是一致的,小松鼠就不贴图了。但是小伙伴们以后分析抓取目标的时候,记得多分析目标数据结构哟。
准备写爬虫:
通过命令scrapy genspider 爬虫名 “目标网址”,来创建一个爬虫。
这就是scrapy为我们创建的小蜘蛛,虽然只有一些字母,但是还是觉得十分可爱哟!针对这个小蜘蛛,小松鼠说几点:一、第一行注释部分,这对于Python2+版本非常重要,特别是涉及到爬虫内容有中文时;二、一般情况下,我们都是不需要这行代码的,这行代码限制爬虫只能访问这个网址。
allowed_domains = ["https://www.cn357.com/notice_300"]
第三、start_urls是整个爬虫的起始点,小蜘蛛们会去访问start_urls里面的网址,但是我们希望小蜘蛛们的出发点是去访问车辆列表,也就是每一页的网址。我们可以重写start_requests这个方法,重写之后,改变小蜘蛛们的出发点:
def start_requests(self):
urls_191 = ['https://www.cn357.com/notice_191_%d' % d for d in range(1, 15)]
urls_300 = ['https://www.cn357.com/notice_300_%d' % d for d in range(1, 86)]
start_urls = urls_300
for url in start_urls:
yield scrapy.Request(url, callback=self.parse)
接下来我们在车辆列表里获取每一辆车的url,这里会用到scrapy内置的选择器xpath,具体的xpath语法大家可以去w3c上面去学习哟。
另外,教大家一个去除网页广告的小技巧:下面这张图我们发现有明显的广告,像这样的广告,在一些羞羞的网站里特别多,影响我们广大男同胞的体验,接下来见证奇迹。
ctrl + u 打开谷歌浏览器开发者模式
可以看到这个广告就是一个图片,我们只需要选中这个img标签,然后按住键盘的Delete就完美的把广告出掉了。
回归主题,我们发现车辆信息只有链接,不是完整的url,所以我们需要自己构建,点击查看几个车辆信息,发现每个车辆信息的链接是由https://www.cn357.com加上我们找到的链接。上代码:
def parse(self, response):
links = response.xpath('//div[@class="gMain"]/table[1]//a/@href').extract()
for link in links:
url = 'https://www.cn357.com' + link
yield scrapy.Request(url, callback=self.parse_content)
然后我们把每个标签对应的数据找到:
def parse_content(self, response):
item = ShangchewangItem()
item['myurl'] = response.url
item['mytime'] = time.strftime('%Y-%m/%d %H:%M:%S',time.localtime(time.time()))
item['announcement_model'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[2]/td[2]//text()').extract())
item['announcement_batch'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[2]/td[4]/text()').extract())
item['brand'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[3]/td[2]//text()').extract())
item['car_type'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[3]/td[4]//text()').extract())
item['rated_quality'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[4]/td[2]//text()').extract())
item['total_quality'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[4]/td[4]//text()').extract())
item['Curing_quality'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[5]/td[2]//text()').extract())
item['fuel_type'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[5]/td[4]//text()').extract())
item['emission_standard'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[6]/td[2]//text()').extract())
item['number_of_axes'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[6]/td[4]//text()').extract())
item['wheelbase'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[7]/td[2]//text()').extract())
item['axle_load'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[7]/td[4]//text()').extract())
item['number_of_spring'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[8]/td[2]//text()').extract())
item['number_of_tire'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[8]/td[4]//text()').extract())
item['standard_tire'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[9]/td[2]//text()').extract())
item['leave_angle'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[9]/td[4]//text()').extract())
item['QianHouXuan'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[10]/td[2]//text()').extract())
item['before_tire_distance'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[10]/td[4]//text()').extract())
item['back_tire_distance'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[11]/td[2]//text()').extract())
item['identification_number'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[11]/td[4]//text()').extract())
item['car_lange'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[12]/td[2]//text()').extract())
item['car_width'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[12]/td[4]//text()').extract())
item['car_hight'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[13]/td[2]//text()').extract())
item['container_lang'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[13]/td[4]//text()').extract())
item['container_width'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[14]/td[2]//text()').extract())
item['container_hight'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[14]/td[4]//text()').extract())
item['highest_speed'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[15]/td[2]//text()').extract())
item['rated_passenger'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[15]/td[4]//text()').extract())
item['cab_people_number'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[16]/td[2]//text()').extract())
item['turn_type'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[16]/td[4]//text()').extract())
item['hang_car_all_quality'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[17]/td[2]//text()').extract())
item['modulus'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[17]/td[4]//text()').extract())
item['must_quality'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[18]/td[2]//text()').extract())
item['firm_name'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[18]/td[4]//text()').extract())
item['firm_address'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[19]/td[2]//text()').extract())
item['TLE'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[19]/td[4]//text()').extract())
item['fax'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[20]/td[2]//text()').extract())
item['postal_code'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[20]/td[4]//text()').extract())
item['chassis_one'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[21]/td[2]//text()').extract())
item['chassis_tow'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[21]/td[4]//text()').extract())
item['chassis_thress'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[22]/td[2]//text()').extract())
item['chassis_four'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[22]/td[4]//text()').extract())
item['engine_model'] = ','.join(response.xpath('//td[@colspan="4"]/table//tr[2]/td[1]/text()').extract())
item['engine_firm'] = ','.join(response.xpath('//td[@colspan="4"]/table//tr[2]/td[2]/text()').extract())
item['engine_brand'] = ','.join(response.xpath('//td[@colspan="4"]/table//tr[2]/td[3]/text()').extract())
item['displacement'] = ','.join(response.xpath('//td[@colspan="4"]/table//tr[2]/td[4]//text()').extract())
item['power'] = ','.join(response.xpath('//td[@colspan="4"]/table//tr[2]/td[5]/text()').extract())
item['remark'] = ''.join(response.xpath('//div[@class="gMain"]/table/tr[24]/td[2]//text()').extract()).strip()
item['img'] = ';'.join(response.xpath('//ul[@class="clear"]//li//img/@src').extract())
yield item
这里说一点,在导入items时的时候,通常大家是按照路径名导入的,但是这个在pycharm中通常会报错,所以我比较喜欢采用这样的形式:
from ..items import ShangchewangItem
然后我们要去pipelines里处理我们获得的数据。这个时候,我们应该先去setting里面逛逛。在setting里将robotstxt_obey改为false,小松鼠一般会将延迟设为一秒:
DOWNLOAD_DELAY = 1
然后启动ITEM_PIPELINES.
由于我们最后需要将获取的数据写入数据库,数据部分,我们将其封装到item里,所以我们在item里写个函数:
def get_insert_sql(self):
insert_sql = '''
insert into cn357(
myurl,
mytime,
announcement_model,
announcement_batch,
brand,
car_type,
rated_quality,
total_quality,
Curing_quality,
fuel_type,
emission_standard,
number_of_axes,
wheelbase,
axle_load,
number_of_spring,
number_of_tire,
standard_tire,
leave_angle,
QianHouXuan,
before_tire_distance,
back_tire_distance,
identification_number,
car_lange,
car_width,
car_hight,
container_lang,
container_width,
container_hight,
highest_speed,
rated_passenger,
cab_people_number,
turn_type,
hang_car_all_quality,
modulus,
must_quality,
firm_name,
firm_address,
TLE,
fax,
postal_code,
chassis_one,
chassis_tow,
chassis_thress,
chassis_four,
engine_model,
engine_firm,
engine_brand,
displacement,
power,
remark,
img
)VALUES
(
'%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s'
);
'''%(
self['myurl'],
self['mytime'],
self['announcement_model'],
self['announcement_batch'],
self['brand'],
self['car_type'],
self['rated_quality'],
self['total_quality'],
self['Curing_quality'],
self['fuel_type'],
self['emission_standard'],
self['number_of_axes'],
self['wheelbase'],
self['axle_load'],
self['number_of_spring'],
self['number_of_tire'],
self['standard_tire'],
self['leave_angle'],
self['QianHouXuan'],
self['before_tire_distance'],
self['back_tire_distance'],
self['identification_number'],
self['car_lange'],
self['car_width'],
self['car_hight'],
self['container_lang'],
self['container_width'],
self['container_hight'],
self['highest_speed'],
self['rated_passenger'],
self['cab_people_number'],
self['turn_type'],
self['hang_car_all_quality'],
self['modulus'],
self['must_quality'],
self['firm_name'],
self['firm_address'],
self['TLE'],
self['fax'],
self['postal_code'],
self['chassis_one'],
self['chassis_tow'],
self['chassis_thress'],
self['chassis_four'],
self['engine_model'],
self['engine_firm'],
self['engine_brand'],
self['displacement'],
self['power'],
self['remark'],
self['img']
)
return insert_sql
代码丑了一点,小松鼠也在想如何优化它,请大家多多指教。
先上pipelines的代码:
import pymysql
class ShangchewangPipeline(object):
def __init__(self):
self.conn = pymysql.connect(host="127.0.0.1", user='**', passwd='**', db="**", charset='utf8')
self.cursor = self.conn.cursor()
def do_insert(self, item):
insert_sql = item.get_insert_sql()
self.cursor.execute(insert_sql)
self.conn.commit()
def process_item(self, item, spider):
self.do_insert(item)
return item
def close_spider(self, spider):
self.cursor.close()
self.conn.close()
数据库大家自己配好,在do_insert函数里调用在item里封装的sql数据。最后,close_spider函数会在爬虫关闭的时候调用,这个时候也是我们释放资源的时候。最后给大家看看数据库的数据:
程序还没跑完,但是现在已经有将近5000条数据了,考虑到速度的话,可以试试异步爬虫和scrapy+redis分布式爬虫。
后期小松鼠会将获取的图片存在本地,并将图片的本地地址存入数据库。