python爬虫-美团商家信息采集

一、背景介绍

由于工作需要,要获取美团商家的机构信息,包括名称,电话,地址等,存为excel格式

二、网页分析

参考 https://blog.csdn.net/xing851483876/article/details/81842329 经验。直接用手机版美团
电脑版的美团获取手机号时需要多点一次鼠标,手机版没有这个步骤,更简单一点


城市列表urlhttp://i.meituan.com/index/changecity?cevent=imt%2Fhd%2FcityBottom
(只用一次获取全部城市即可)


商家列表url示例http://i.meituan.com/s/wanzhou-教育?p=2
两个关键部分:1)城市:bobaixian;2)关键词:教育(需要Unicode编码);3)页码


详情页url示例http://i.meituan.com/poi/179243134
根据列表页中获取到商家ID,即179243134
ct_poi:240684642564654412435083837672355283025_e8694741092540145794_v1070221787272473329__21_a%e6%95%99%e8%82%b2
原本有ctpoi,后来没了,就先不需要,需要的话在列表页再加上


思路如下:
1)根据城市列表获取全部城市的拼音缩写;
2)根据城市拼音拼接列表页,从列表页获取商家链接;
3)循环遍历商家链接,获取需要的字段;

三、开始爬虫

1、城市获取

python爬虫-美团商家信息采集
根据这里面的工具:https://blog.csdn.net/weixin_43420032/article/details/84646041 直接转成python,连header和cookies都给弄好了
python爬虫-美团商家信息采集
直接获取到城市和缩写的字典,方便后续可以根据输入城市和关键词进行获取:

def get_cities_wap():   #网页端获取城市及简写
    cookies = {
        '__mta': '209614381.1543978383220.1543978491990.1543978501965.5',
        '_lxsdk_cuid': '16666fc2e54c8-06bb633ea17d43-737356c-15f900-16666fc2e54c8',
        'oc': 'Ze9dLOWSIlgu7r7EbFMStrH7FxUq57MiiNsP2vGkntNcdKo_CV5R2rHC7W9jVd9dPbO4UY_R3GRmoZhCH62HUnibfEBt7ArKLhxtVp_F4MBIfn1mLfucCPiTqWKtLPjSb65K76r1y49Ol1tEWBAqjvuF08yuJ39OBE8LEAk1wYM',
        'uuid': '0089ef8aea0b44b28a39.1543568012.1.0.0',
        '_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic',
        'JSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
        'IJSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
        'iuuid': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
        '_lxsdk': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
        'webp': '1',
        '__utmc': '74597006',
        '__utmz': '74597006.1543917113.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)',
        '_hc.v': 'e06c0122-49b5-981c-ec71-64d1b97be3c1.1543917118',
        'ci': '174',
        'rvct': '174%2C957%2C517%2C1',
        'cityname': '%E4%B8%83%E5%8F%B0%E6%B2%B3',
        '__utma': '74597006.1820410032.1543917113.1543917113.1543978384.2',
        'ci3': '1',
        'idau': '1',
        'i_extend': 'H__a100001__b2',
        'latlng': '39.90569,116.22299,1543978425426',
        '__utmb': '74597006.19.9.1543978504738',
        '_lxsdk_s': '1677c4847d9-0cf-fe9-ad0%7C%7C24',
    }
    
    headers = {
        'Connection': 'keep-alive',
        'Cache-Control': 'max-age=0',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Referer': 'http://i.meituan.com/index/changecity?cevent=imt%2Fhd%2FcityBottom',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
    }
    
    params = (
        ('cevent', 'imt/hd/cityBottom'),
    )
    
    response = requests.get('http://i.meituan.com/index/changecity', headers=headers, params=params, cookies=cookies)
    result = str(response.content,'utf-8')
    soup = bs(result,'html.parser')   #用beautifulsoup解析
    #s1 = soup.find_all(name='a',attrs={'class':'react'})    #获取全部城市节点
    #https://www.cnblogs.com/cymwill/articles/7574479.html
    s1 = soup.find_all(lambda tag:tag.has_attr('class') and tag.has_attr('data-citypinyin'))
    dics = {}   #城市和缩写的字典
    for i in s1:
        city = i.text
        jianxie = i['data-citypinyin']
        dic = {city:jianxie}
        dics.update(dic)
    return dics

ps:s1 = soup.find_all(lambda tag:tag.has_attr(‘class’) and tag.has_attr(‘data-citypinyin’)) 1
说实话这里面的东西并不能看太懂的!

2、列表页拼接

城市和关键词有了,现在就差如何获取页数
获取到页数之后就可以用for循环取得所有页的数据了
保(jiu)险(shi)起(lan)见(de),还利用上面的工具获取python代码块:

def getOrg(city,sw,page):
    cookies = {
        '__mta': '209614381.1543978383220.1543997852031.1543998361599.17',
        '_lxsdk_cuid': '16666fc2e54c8-06bb633ea17d43-737356c-15f900-16666fc2e54c8',
        'oc': 'Ze9dLOWSIlgu7r7EbFMStrH7FxUq57MiiNsP2vGkntNcdKo_CV5R2rHC7W9jVd9dPbO4UY_R3GRmoZhCH62HUnibfEBt7ArKLhxtVp_F4MBIfn1mLfucCPiTqWKtLPjSb65K76r1y49Ol1tEWBAqjvuF08yuJ39OBE8LEAk1wYM',
        'uuid': '0089ef8aea0b44b28a39.1543568012.1.0.0',
        '_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic',
        'JSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
        'IJSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
        'iuuid': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
        '_lxsdk': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
        'webp': '1',
        '__utmc': '74597006',
        '__utmz': '74597006.1543917113.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)',
        '_hc.v': 'e06c0122-49b5-981c-ec71-64d1b97be3c1.1543917118',
        'rvct': '174%2C957%2C517%2C1',
        'ci3': '1',
        'a2h': '3',
        'idau': '1',
        '__utma': '74597006.1820410032.1543917113.1543978512.1543997779.4',
        'i_extend': 'C_b3E240684642564654412435083837672355283025_e8694741092540145794_v1070221787272473329_a%e6%95%99%e8%82%b2GimthomepagesearchH__a100005__b4',
        'ci': '1',
        'cityname': '%E5%8C%97%E4%BA%AC',
        '__utmb': '74597006.4.9.1543997782892',
        'latlng': '39.90569,116.22299,1543998363681',
        '_lxsdk_s': '1677d706961-ed1-ece-8b8%7C%7C5',
    }
    
    headers = {
        'Connection': 'keep-alive',
        'Cache-Control': 'max-age=0',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
    }
    
    params = (
        ('p', page),
    )
    
    url = 'http://i.meituan.com/s/{}-{}'.format(city,sw)
    
    response = requests.get(url, headers=headers, params=params, cookies=cookies)
    result = str(response.content,'utf-8')
    soup = bs(result,'html.parser')   #用beautifulsoup解析
    
    jigous = soup.find_all(name='dd',attrs={'class':'poi-list-item'})
    arrs = []   #初始化店铺链接和ctpoi列表
    for jigou in jigous:
        #jigous.find(name='span',attrs={'class':'poiname'}).text    #列表页机构名称
        href = 'http:'+jigou.findChild('a').attrs['href']
        ctpoi = jigou.findChild('a').attrs['data-ctpoi']
        arr = [href,ctpoi]
        arrs.append(arr)
    return arrs

ps:ctpoi = jigou.findChild(‘a’).attrs[‘data-ctpoi’] 2

3、最大页数

待进行

4、详情页拼接及获取数据

待进行

四、数据存储

待进行,openpyxl


  1. https://www.cnblogs.com/cymwill/articles/7574479.html 为了获取包含指定属性的节点,从而获取城市列表; ↩︎

  2. https://zhidao.baidu.com/question/366981561612724444.html 为了获取指定属性中的属性值; ↩︎