python正则表达式
记录一次爬取ichunqiu网课标题的过程
首先看看有没有防爬机制
def getTitle():
requests.get(url=url)
html = req.content
print(html)
if __name__ == '__main__':
getTitle()
返回了一个错误页面,果然有防爬机制
给上header头
headers = {
'Host': 'www.ichunqiu.com',
'Connection': 'close',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8',
}
成功过检
查看网页代码,找到标题所在块
找到了,在class=“coursename” 和 onclick中间
titlere = r'class="coursename" title="(.*?)" onclick'
title = re.findall(titlere, html)
结果出错了
Traceback (most recent call last):
File "1.py", line 24, in <module>
getTitle()
File "1.py", line 20, in getTitle
title = re.findall(titlere, html)
File "C:\Python36\lib\re.py", line 222, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
网上解决一下
加上
html = html.decode('utf-8') # python3
成功
代码如下:
def getTitle():
url = 'https://www.ichunqiu.com/courses'
headers = {
'Host': 'www.ichunqiu.com',
'Connection': 'close',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8',
}
req = requests.get(url=url, headers=headers)
html = req.content
titlere = r'class="coursename" title="(.*?)" onclick'
html = html.decode('utf-8') # python3
title = re.findall(titlere, html)
for titles in title:
print(titles)
if __name__ == '__main__':
getTitle()