python正则表达式

记录一次爬取ichunqiu网课标题的过程

首先看看有没有防爬机制

def getTitle():
    requests.get(url=url)
    html = req.content
    print(html)
if   __name__ == '__main__':
    getTitle()

返回了一个错误页面,果然有防爬机制
给上header头

    headers = {
        'Host': 'www.ichunqiu.com',
        'Connection': 'close',
        'Cache-Control': 'max-age=0',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.8',
    }

成功过检
查看网页代码,找到标题所在块
python正则表达式

找到了,在class=“coursename” 和 onclick中间

    titlere = r'class="coursename" title="(.*?)" onclick'
    title = re.findall(titlere, html)

结果出错了

Traceback (most recent call last):
  File "1.py", line 24, in <module>
    getTitle()
  File "1.py", line 20, in getTitle
    title = re.findall(titlere, html)
  File "C:\Python36\lib\re.py", line 222, in findall
    return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object

网上解决一下
加上

    html = html.decode('utf-8')  # python3

成功

代码如下:

def getTitle():
    url = 'https://www.ichunqiu.com/courses'
    headers = {
        'Host': 'www.ichunqiu.com',
        'Connection': 'close',
        'Cache-Control': 'max-age=0',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.8',
    }
    req = requests.get(url=url, headers=headers)
    html = req.content
    titlere = r'class="coursename" title="(.*?)" onclick'
    html = html.decode('utf-8')  # python3
    title = re.findall(titlere, html)
    for titles in title:
     print(titles)
if __name__ == '__main__':
    getTitle()

python正则表达式