scrapy爬取--腾讯社招的网站
需求:得到相应的职位、职位类型、职位的链接 、招聘人数、工作地点、发布时间
一、创建Scrapy项目的流程
1)使用命令创建爬虫腾讯招聘的职位项目:scrapy startproject tencent
2)进程项目命令:cd tencent,并且创建爬虫:scrapy genspider tencentPosition hr.tencent.com
3) 使用PyCharm打开项目
4)根据需求分析,完成items.py文件的字段
5)完成爬虫的编写
6)管道文件的编程
7)settings.py文件的配置信息
8)pycharm打开文件的效果图:
二、编写各个文件的代码:
1.tencentPosition.py文件
import scrapy from tencent.items import TencentItem class TencentpositionSpider(scrapy.Spider): name = 'tencentPosition' allowed_domains = ['hr.tencent.com'] offset = 0 url = "https://hr.tencent.com/position.php?&start=" start_urls = [url + str(offset) + '#a', ] def parse(self, response): position_lists = response.xpath('//tr[@class="even"] | //tr[@class="odd"]') for postion in position_lists: item = TencentItem() position_name = postion.xpath("./td[1]/a/text()").extract()[0] position_link = postion.xpath("./td[1]/a/@href").extract()[0] position_type = postion.xpath("./td[2]/text()").get() people_num = postion.xpath("./td[3]/text()").extract()[0] work_address = postion.xpath("./td[4]/text()").extract()[0] publish_time = postion.xpath("./td[5]/text()").extract()[0] item["position_name"] = position_name item["position_link"] = position_link item["position_type"] = position_type item["people_num"] = people_num item["work_address"] = work_address item["publish_time"] = publish_time yield item # 下一页的数据 total_page = response.xpath('//div[@class="left"]/span/text()').extract()[0] print(total_page) if self.offset < int(total_page): self.offset += 10 new_url = self.url + str(self.offset) + "#a" yield scrapy.Request(new_url, callback=self.parse)
2.items.py 文件
import scrapy class TencentItem(scrapy.Item): # define the fields for your item here like: position_name = scrapy.Field() position_link = scrapy.Field() position_type = scrapy.Field() people_num = scrapy.Field() work_address = scrapy.Field() publish_time = scrapy.Field()
*****切记字段和TencentpositionSpider.py文件保持一致
3.pipelines.py文件
import json class TencentPipeline(object): def __init__(self): print("=======start========") self.file = open("tencent.json", "w", encoding="utf-8") def process_item(self, item, spider): print("=====ing=======") dict_item = dict(item) # 转换成字典 json_text = json.dumps(dict_item, ensure_ascii=False) + "\n" self.file.write(json_text) return item def close_spider(self, spider): print("=======end===========") self.file.close()
4.settings.py文件
5.运行文件:
1)在根目录下创建一个main.py
2)main.py文件
from scrapy import cmdline cmdline.execute("scrapy crawl tencentPosition".split())
三、运行效果:
路过的关注一下,我会继续努力的!!!!