python爬取拉勾网单页职位信息

记录一下走过的几个坑,代码还是写的很小白。。。

1、拉勾反爬做的不错,我用request的get得到一个操作频繁的结果,参考网上的用的post方法就可以啦;

2、输出结果是json格式的,我用正则匹配了很久都是空列表,最后发现用json一下就提取出来了;

3、对于列表、字典等的操作,不熟悉,我想得到好看的整齐的数据,最后格式处理不行,得回头看看书。

#爬取拉勾网上python招聘信息
import requests
import re

def getHtmlText(url):
    try:
        headers = {
        
        'Host':'www.lagou.com',
        #'Origin':'https://www.lagou.com',
        'Referer':'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?px=default&city=%E5%B9%BF%E5%B7%9E',
        'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Mobile Safari/537.36',
        'X-Anit-Forge-Code':'0',
        'X-Anit-Forge-Token':'None',
        'X-Requested-With':'XMLHttpRequest'}
        data = {'first':'true','kd':'数据分析','pn':'1'}
        r = requests.post(url,headers = headers,data=data)
        #r = requests.get(url,timeout = 30)
        #print(r.text)
        return r
    except:
        return ""
    

def parsePage(info,html):
    res_json = html.json()  # 获取json数据
    result = res_json['content']['positionResult']['result']
    #print(result)
    info = []
    for i in range(len(result)):
        #print(result[i]['salary'])
        salary=result[i]['salary']
        name=result[i]['companyShortName']
        workYear=result[i]['workYear']
        education=result[i]['education']
        info.append([name,salary,workYear,education])
        print(info[i])
    return info

def main():
    jobinfo = []
    url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false'
    html = getHtmlText(url)
    parsePage(jobinfo,html)
    
main()
    

结果如下:

python爬取拉勾网单页职位信息