通过ajax接口爬取智联招聘

        上次分析了抓取智联招聘网站遇到的坑,最后使用selenium模拟的方法抓取到了智联的招聘数据,但是我发现通过分析智联招聘的ajax接口,模拟ajax请求直接获取json数据这种方法更简单。

        分析网页ajax接口:在搜索框输入python,点击搜索,打开浏览器开发者模式,点击network,点击XHR过滤出来ajax请求

第一页:

通过ajax接口爬取智联招聘

这是搜索之后第一页的信息,可以看到很多参数,直接观察好像并不能查看出哪些参数对我们有用,那我们接着点击下一页继续分析

通过ajax接口爬取智联招聘

这是点击到第二页的参数,可以看到相比于第一页,这次的参数多出来一个start参数,数据为60。接着点击第三页,继续查看它的参数。

通过ajax接口爬取智联招聘

可以看到这里start参数变成了120,不难看出,start参数是控制页面每次显示的数据从第几条开始,pageSize为每页显示多少条数据,固定为60。到这里就已经明确可以通过urlencode将参数拼接,通过控制start的参数传入来控制我们要抓取哪些页的数据。

分析Preview:

通过ajax接口爬取智联招聘

通过对Preview的观察可以发现,返回的json数据中有一个data字段,data中包含有results字段,点击results字段可以发现网页上的工作名称,工作薪资,工作详情页的地址都在其中,那么我们就可以通过response.json方法获取到网页返回的json数据,从中解析出我们想要的数据。

代码思路:1.分析网页的ajax接口,需要传入哪些参数2.通过控制传入的start参数来控制要爬取的哪些页,生成需要爬取的url列表3.请求这些url,解析出工作详情页4.请求工作详情页,解析出工作的详细信息5.保存到数据库

完整代码:

import time
from datetime import datetime
import requests
import lxml.html
from urllib.parse import urlencode, urlparse
import random
from retrying import retry
import sqlite3

db = sqlite3.connect("zhilian.db")
cursor = db.cursor()

class Throttle:
    """
    下载限速器
    """
    def __init__(self,delay):
        self.domains = {}
        self.delay = delay

    def wait_url(self,url_str):
        domain_url = urlparse(url_str).netloc
        last_accessed = self.domains.get(domain_url)
        if self.delay > 0 and last_accessed is not None:
            sleep_interval = self.delay - (datetime.now() - last_accessed).seconds
            if sleep_interval > 0:
                time.sleep(sleep_interval + round(random.uniform(1,3),1))

        self.domains[domain_url] = datetime.now()

User_Agent = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"
]

# 查看ajax请求头参数
headers = {
    "Host":"fe-api.zhaopin.com",
    "Origin":"https://sou.zhaopin.com",
    "Referer":"https://sou.zhaopin.com/?p=3&pageSize=60&jl=489&kw=python&kt=3",
    "User-Agent":"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"
}



throttle = Throttle(3)

def get_page(start,page):
    """
    分析ajax请求的参数并请求ajax接口
    :param start: ajax实现分页的参数,从第几条数据开始
    :param page: 偏移数据量对应的页码
    :return:
    """
    params = {
        "start":str(start),
        "pageSize":"60",
        "cityId":"489",
        "workExperience":"-1",
        "education":"-1",
        "companyType":"-1",
        "employmentType":"-1",
        "jobWelfareTag":"-1",
        "kw":"python",
        "kt":"3",
        "lastUrlQuery":{"p":str(page),"pageSize":"60","jl":"489","kw":"python","kt":"3"}
    }

    url = "https://fe-api.zhaopin.com/c/i/sou?" + urlencode(params)
    result_json = download_retry(url)
    return result_json

@retry(stop_max_attempt_number=4)
def download_retry(url):
    """
    真正的下载类
    :param url:
    :return:
    """
    try:
        response = requests.get(url,headers=headers,timeout=5)
        if response.status_code == 200:
            result = response.json()
        else:
            result = None
    except:
        raise ConnectionError
    return result

def get_content(json):
    if json is not None:
        result_items = json.get("data").get("results")
        for item in result_items:
            job_title = item.get("jobName")
            job_city = item.get("city").get("display")
            job_salary = item.get("salary")
            job_edu = item.get("eduLevel").get("name")
            job_company = item.get("company").get("name")
            job_url = item.get("positionURL")
            yield {
                "job_title":job_title,
                "job_city":job_city,
                "job_salary":job_salary,
                "job_edu":job_edu,
                "job_company":job_company,
                "job_url":job_url,
            }

def download_detail(url):
    throttle.wait_url(url)
    response = requests.get(url,headers={"User-Agent":random.choice(User_Agent)},timeout=5)
    html = lxml.html.fromstring(response.text)
    detail1 = html.xpath('//div[@class="tab-inner-cont"]/*/text()')
    detail2 = html.xpath('//div[@class="tab-inner-cont"]/p/span/text()')
    if detail1 is not None:
        job_detail=detail1
    elif detail2 is not None:
        job_detail = detail2
    else:
        job_detail = "暂无"
    return job_detail

if __name__ == '__main__':
    for i in range(1,100):
        result = get_page(i*60,i-1)
        for content in get_content(result):
            print("正在下载:",content["job_url"])
            job_detail = download_detail(content["job_url"])
            detail = ""
            for i in job_detail:
                detail += i
            detail = ''.join(detail.split())
            content["job_detail"] = detail
            cursor.execute("insert into Job(jobName,jobSalary,jobCity,jobEdu,jobCompany,jobDetail) values(?,?,?,?,?,?)",[content["job_title"],content["job_salary"],content["job_city"],content["job_edu"],content["job_company"],content["job_detail"]])
            db.commit()

这里我加入了一个下载的限速器,防止抓取太快ip被封。