python爬取拉勾网单页职位信息
记录一下走过的几个坑,代码还是写的很小白。。。
1、拉勾反爬做的不错,我用request的get得到一个操作频繁的结果,参考网上的用的post方法就可以啦;
2、输出结果是json格式的,我用正则匹配了很久都是空列表,最后发现用json一下就提取出来了;
3、对于列表、字典等的操作,不熟悉,我想得到好看的整齐的数据,最后格式处理不行,得回头看看书。
#爬取拉勾网上python招聘信息
import requests
import re
def getHtmlText(url):
try:
headers = {
'Host':'www.lagou.com',
#'Origin':'https://www.lagou.com',
'Referer':'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?px=default&city=%E5%B9%BF%E5%B7%9E',
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Mobile Safari/537.36',
'X-Anit-Forge-Code':'0',
'X-Anit-Forge-Token':'None',
'X-Requested-With':'XMLHttpRequest'}
data = {'first':'true','kd':'数据分析','pn':'1'}
r = requests.post(url,headers = headers,data=data)
#r = requests.get(url,timeout = 30)
#print(r.text)
return r
except:
return ""
def parsePage(info,html):
res_json = html.json() # 获取json数据
result = res_json['content']['positionResult']['result']
#print(result)
info = []
for i in range(len(result)):
#print(result[i]['salary'])
salary=result[i]['salary']
name=result[i]['companyShortName']
workYear=result[i]['workYear']
education=result[i]['education']
info.append([name,salary,workYear,education])
print(info[i])
return info
def main():
jobinfo = []
url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false'
html = getHtmlText(url)
parsePage(jobinfo,html)
main()
结果如下: