电影芳华豆瓣评论爬取
这里没有尝试去登录豆瓣,而是Copy浏览器端的Header直接去模拟请求,主要是学习了一下BeautifulSoup解析html页面,直接用Header里现有的Cookie即可
我们看一下短评页面的Element,截选了其中一个人的信息如下
<div class="avatar">
<a title="影志" href="https://www.douban.com/people/tjz230/">
<img src="https://img1.doubanio.com/icon/u1005928-127.jpg" class="">
</a>
</div>
<div class="comment">
<h3>
<span class="comment-vote">
<span class="votes">22920</span>
<input value="1241835988" type="hidden">
<a href="javascript:;" class="j a_vote_comment" onclick="">有用</a>
</span>
<span class="comment-info">
<a href="https://www.douban.com/people/tjz230/" class="">影志</a>
<span>看过</span>
<span class="allstar40 rating" title="推荐"></span>
<span class="comment-time " title="2017-09-11 19:23:52">
2017-09-11
</span>
</span>
</h3>
<p class="">
<span class="short">“没有被善待的人,最容易识别善良,也最珍惜善良。” 适合带长辈们看,或许多少年后,就没人再拍这样的电影了…后面半小时泪弹太足,我们在最好的年代虚度光阴,他们在最坏的年代洗尽铅华。</span>
</p>
<a class="js-irrelevant irrelevant" href="javascript:;">这条短评跟影片无关</a>
<div class="comment-report" style="visibility: hidden;"><a rel="nofollow" href="javascript:void(0)">举报</a></div></div>
一步一步的解析我们需要的信息
import requests
from bs4 import BeautifulSoup
import re
import time
#初始化字典
result_dict = dict()
df_col =['title','comment','star','rate','time','vote']
for ele in df_col:
result_dict[ele]=[]
#一共只能爬24页
for i in range(0,500,20):
url2 = 'https://movie.douban.com/subject/26862829/comments?start='+str(i)+'&limit=20&sort=new_score&status=P'
head2={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'ache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'll="108288"; bid=qoEA-FVRSpY; __utmc=30149280; __utmz=30149280.1540704092.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); dbcl2="148273687:Uv1tgN2EzCo"; _ga=GA1.2.2062603511.1540704092; _gid=GA1.2.777473429.1540704112; ck=mVbA; push_noty_num=0; push_doumail_num=0; __utmv=30149280.14827; __utma=30149280.2062603511.1540704092.1540712214.1540718153.4; __utma=223695111.2062603511.1540704092.1540718160.1540718160.1; __utmb=223695111.0.10.1540718160; __utmc=223695111; __utmz=223695111.1540718160.1.1.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/search; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1540718160%2C%22https%3A%2F%2Fwww.douban.com%2Fsearch%3Fsource%3Dsuggest%26q%3D%25E5%2587%25BA.%25E8%25B7%25AF%22%5D; _pk_ses.100001.4cf6=*; _vwo_uuid_v2=D530C908A4E0F86DFDC7DF75A79EA1C5E|c30117114149cf92b5ad1183944ddd83; __utmt=1; ap_v=0,6.0; __utmb=30149280.8.10.1540718153; _pk_id.100001.4cf6=a8a8b8e695681724.1540718160.1.1540719870.1540718160.',
'Host': 'movie.douban.com',
'Referer': 'https://www.douban.com/',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
session = requests.session()
login_page2 = session.get(url2,headers=head2)
print('status:',login_page2.status_code)
page2 = login_page2.text
soup2 = BeautifulSoup(page2, "html.parser")
#解析昵称
title = soup2.findAll('a', attrs={'href':re.compile(r'https://www.douban.com/people/*')})
for i in title:
if i.get('title')!=None:
result_dict['title'].append(i.get('title'))
#解析评论
comment = soup2.findAll('span', attrs={'class':'short'})
for i in comment:
result_dict['comment'].append(i.get_text())
#解析得分 评价
comment_info = soup2.findAll('span', attrs={'class':'comment-info'})
for i in comment_info:
if i.select('span')[1].get('class')[0] =='comment-time':
result_dict['star'].append('none')
result_dict['rate'].append('none')
else:
result_dict['star'].append(i.select('span')[1].get('class')[0])
result_dict['rate'].append(i.select('span')[1].get('title'))
#解析评论时间
time = soup2.findAll('span', attrs={'class':'comment-time'})
for i in time:
result_dict['time'].append(i.get('title'))
#解析有用投票
vote = soup2.findAll('span', attrs={'class':'votes'})
for i in vote:
result_dict['vote'].append(i.get_text())
#字典转换成DataFrame
import pandas as pd
result_df = pd.DataFrame(result_dict)
最后得到的数据效果如下: