python爬取小说详解

爬取的小说url为http://www.biquw.com/book/29142/

第一步：geturl 解析url（用BeautifulSoup）

start_url="http://www.biquw.com/book/29142/"

url=start_url+str(11987333)+'.html'

html=requests.get(url,timeout=15)

soup=BeautifulSoup(html.content,'lxml')//这里用html.text也是可以的

python爬取小说详解

易知小说内容放在id=‘htmlContent’的标签内

题目放在class=‘h1title’

所以寻找所有标签为id=‘htmlContent’或者class=‘h1title’内的内容

title=soup.find_all('div',class_='h1title')

content=soup.find_all('div',id='htmlContent')

s=''.join('%s' %id for id in content)//为什么选择这样而不是用s=’’.join(content)

python爬取小说详解

上网查了资料，说list包含数字，不能直接转化成字符串。

Join资料在这个网站https://blog.****.net/laochu250/article/details/67649210

python爬取小说详解

运行后发现还有一个’<br/>’，第一个想法是用replace来替换’<br/>’

所以运行

s.replace('<br/>','')

print(s)

但是结果却不是这样<br/>还是存在

python爬取小说详解

后来发现我是真的蠢

在python中字符串是immutable的对象，replace是不会直接变更字符串内容的，只会创建一个新的。需要重新引用将replace返回的替换后的字符串结果。

python爬取小说详解

现在那些该死的<br/>终于没了

python爬取小说详解

发现标题中还有那些标签

用replace删去

s=''.join('%s' %id for id in content)

t=''.join('%s' %id for id in title)

s=s.replace('<br/>','')

t=t.replace('<div class="h1title">','')

t=t.replace('</h1>','')

t=t.replace('</div>','')

接下来就是用for循环遍历所有的文章

第一章的url=http://www.biquw.com/book/29142/11987333.html

最后一章的url=http://www.biquw.com/book/29142/11989832.html

只有最后的四个数字变了，那么就开始遍历

python爬取小说详解

是不是感觉没有问题？那么打开我们写入的文件看看

python爬取小说详解

What？这些问号什么鬼

还是用replace，是什么？就是空格啦，本质就是\xa0啦，反正我是这么理解的，所以s=s.replace('\xa0','')替换吧

python爬取小说详解

解决了，没有问题了似乎一切都搞定了，但是嗯嗯嗯.....还有一个问题，低效率，是的低效率for i in range(11987333,11989832):

按照这个规律第二章最后的数字应该是11987333但实际上最后的数据是11987335

python爬取小说详解

第二章比第一章大了2个数字，那么第三章呢？比第二章大了3个数字。

http://www.biquw.com/book/29142/11987338.html

但是不搭嘎，反正我累了。

最后附源代码：

import requests

import os

import re

from bs4 import BeautifulSoup

start_url="http://www.biquw.com/book/29142/"

for i in range(11987333,11989832):

url=start_url+str(i)+'.html'

html=requests.get(url,timeout=15)

soup=BeautifulSoup(html.content,'lxml')

title=soup.find_all('div',class_='h1title')

content=soup.find_all('div',id='htmlContent')

s=''.join('%s' %id for id in content)

t=''.join('%s' %id for id in title)

s=s.replace('<br/>','')

s=s.replace('<div class="contentbox clear" id="htmlContent">','')

s=s.replace('\xa0','')

t=t.replace('<div class="h1title">','')

t=t.replace('</h1>','')

t=t.replace('</div>','')

print(t)

with open("召唤千军.txt",'a') as f:

f.write(t)

f.write(s)

f.close()

大佬们看了多提意见呗，萌新一个呢。这还搞了好久的感觉智商收到了碾压

python爬取小说详解

相关推荐