python爬虫-美团商家信息采集
一、背景介绍
由于工作需要,要获取美团商家的机构信息,包括名称,电话,地址等,存为excel格式
二、网页分析
参考 https://blog.****.net/xing851483876/article/details/81842329 经验。直接用手机版美团
电脑版的美团获取手机号时需要多点一次鼠标,手机版没有这个步骤,更简单一点
城市列表url:http://i.meituan.com/index/changecity?cevent=imt%2Fhd%2FcityBottom
(只用一次获取全部城市即可)
商家列表url示例:http://i.meituan.com/s/wanzhou-教育?p=2
两个关键部分:1)城市:bobaixian;2)关键词:教育(需要Unicode编码);3)页码
详情页url示例:http://i.meituan.com/poi/179243134
根据列表页中获取到商家ID,即179243134ct_poi:240684642564654412435083837672355283025_e8694741092540145794_v1070221787272473329__21_a%e6%95%99%e8%82%b2
原本有ctpoi,后来没了,就先不需要,需要的话在列表页再加上
思路如下:
1)根据城市列表获取全部城市的拼音缩写;
2)根据城市拼音拼接列表页,从列表页获取商家链接;
3)循环遍历商家链接,获取需要的字段;
三、开始爬虫
1、城市获取
根据这里面的工具:https://blog.****.net/weixin_43420032/article/details/84646041 直接转成python,连header和cookies都给弄好了
直接获取到城市和缩写的字典,方便后续可以根据输入城市和关键词进行获取:
def get_cities_wap(): #网页端获取城市及简写
cookies = {
'__mta': '209614381.1543978383220.1543978491990.1543978501965.5',
'_lxsdk_cuid': '16666fc2e54c8-06bb633ea17d43-737356c-15f900-16666fc2e54c8',
'oc': 'Ze9dLOWSIlgu7r7EbFMStrH7FxUq57MiiNsP2vGkntNcdKo_CV5R2rHC7W9jVd9dPbO4UY_R3GRmoZhCH62HUnibfEBt7ArKLhxtVp_F4MBIfn1mLfucCPiTqWKtLPjSb65K76r1y49Ol1tEWBAqjvuF08yuJ39OBE8LEAk1wYM',
'uuid': '0089ef8aea0b44b28a39.1543568012.1.0.0',
'_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic',
'JSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
'IJSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
'iuuid': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
'_lxsdk': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
'webp': '1',
'__utmc': '74597006',
'__utmz': '74597006.1543917113.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)',
'_hc.v': 'e06c0122-49b5-981c-ec71-64d1b97be3c1.1543917118',
'ci': '174',
'rvct': '174%2C957%2C517%2C1',
'cityname': '%E4%B8%83%E5%8F%B0%E6%B2%B3',
'__utma': '74597006.1820410032.1543917113.1543917113.1543978384.2',
'ci3': '1',
'idau': '1',
'i_extend': 'H__a100001__b2',
'latlng': '39.90569,116.22299,1543978425426',
'__utmb': '74597006.19.9.1543978504738',
'_lxsdk_s': '1677c4847d9-0cf-fe9-ad0%7C%7C24',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Referer': 'http://i.meituan.com/index/changecity?cevent=imt%2Fhd%2FcityBottom',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
params = (
('cevent', 'imt/hd/cityBottom'),
)
response = requests.get('http://i.meituan.com/index/changecity', headers=headers, params=params, cookies=cookies)
result = str(response.content,'utf-8')
soup = bs(result,'html.parser') #用beautifulsoup解析
#s1 = soup.find_all(name='a',attrs={'class':'react'}) #获取全部城市节点
#https://www.cnblogs.com/cymwill/articles/7574479.html
s1 = soup.find_all(lambda tag:tag.has_attr('class') and tag.has_attr('data-citypinyin'))
dics = {} #城市和缩写的字典
for i in s1:
city = i.text
jianxie = i['data-citypinyin']
dic = {city:jianxie}
dics.update(dic)
return dics
ps:s1 = soup.find_all(lambda tag:tag.has_attr(‘class’) and tag.has_attr(‘data-citypinyin’)) 1
说实话这里面的东西并不能看太懂的!
2、列表页拼接
城市和关键词有了,现在就差如何获取页数
获取到页数之后就可以用for循环取得所有页的数据了
保(jiu)险(shi)起(lan)见(de),还利用上面的工具获取python代码块:
def getOrg(city,sw,page):
cookies = {
'__mta': '209614381.1543978383220.1543997852031.1543998361599.17',
'_lxsdk_cuid': '16666fc2e54c8-06bb633ea17d43-737356c-15f900-16666fc2e54c8',
'oc': 'Ze9dLOWSIlgu7r7EbFMStrH7FxUq57MiiNsP2vGkntNcdKo_CV5R2rHC7W9jVd9dPbO4UY_R3GRmoZhCH62HUnibfEBt7ArKLhxtVp_F4MBIfn1mLfucCPiTqWKtLPjSb65K76r1y49Ol1tEWBAqjvuF08yuJ39OBE8LEAk1wYM',
'uuid': '0089ef8aea0b44b28a39.1543568012.1.0.0',
'_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic',
'JSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
'IJSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
'iuuid': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
'_lxsdk': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
'webp': '1',
'__utmc': '74597006',
'__utmz': '74597006.1543917113.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)',
'_hc.v': 'e06c0122-49b5-981c-ec71-64d1b97be3c1.1543917118',
'rvct': '174%2C957%2C517%2C1',
'ci3': '1',
'a2h': '3',
'idau': '1',
'__utma': '74597006.1820410032.1543917113.1543978512.1543997779.4',
'i_extend': 'C_b3E240684642564654412435083837672355283025_e8694741092540145794_v1070221787272473329_a%e6%95%99%e8%82%b2GimthomepagesearchH__a100005__b4',
'ci': '1',
'cityname': '%E5%8C%97%E4%BA%AC',
'__utmb': '74597006.4.9.1543997782892',
'latlng': '39.90569,116.22299,1543998363681',
'_lxsdk_s': '1677d706961-ed1-ece-8b8%7C%7C5',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
params = (
('p', page),
)
url = 'http://i.meituan.com/s/{}-{}'.format(city,sw)
response = requests.get(url, headers=headers, params=params, cookies=cookies)
result = str(response.content,'utf-8')
soup = bs(result,'html.parser') #用beautifulsoup解析
jigous = soup.find_all(name='dd',attrs={'class':'poi-list-item'})
arrs = [] #初始化店铺链接和ctpoi列表
for jigou in jigous:
#jigous.find(name='span',attrs={'class':'poiname'}).text #列表页机构名称
href = 'http:'+jigou.findChild('a').attrs['href']
ctpoi = jigou.findChild('a').attrs['data-ctpoi']
arr = [href,ctpoi]
arrs.append(arr)
return arrs
ps:ctpoi = jigou.findChild(‘a’).attrs[‘data-ctpoi’] 2
3、最大页数
待进行
4、详情页拼接及获取数据
待进行
四、数据存储
待进行,openpyxl
-
https://www.cnblogs.com/cymwill/articles/7574479.html 为了获取包含指定属性的节点,从而获取城市列表; ↩︎
-
https://zhidao.baidu.com/question/366981561612724444.html 为了获取指定属性中的属性值; ↩︎