第一个爬虫程序，我与爬虫不得不写的博客

目标：成功爬取一个小说网站的某个小说所有内容：

工具：Python3.5，pycharm

历时：12小时（很多时间都在纠结）

结果：当然是成功了

# -*- coding: utf-8 -*-
import requests
import re
import string
#下载一个网页
url = 'http://www.jingcaiyuedu.com/book/15401/list.html'
#模拟浏览器发送http请求,通过requests发送url get请求，服务器response
# 返回响应、 数据等
response = requests.get(url)
#规定网页编码方式
response.encoding = 'utf-8'
#目标小说主页源代码
html = response.text
#小说名字
#
title = re.findall(r'<title>(.*?)</title>', html)
#新建一个文件，保存小说内容,with open新建打开一个文件，‘w'写的方式打开
fb = open('%s.txt' % title, 'w', encoding='utf-8')
# with open('%s.txt' % title) as f;也可上面那样写
#print获取数据，我们请求的链接是个网页，是个文本，故加text，乱码了
#获取每一章的信息（章节，url),文本处理，正则表达式,这个不对，根据实际情况爬取
#dl = re.findall(r'<dl id="list">.*?</dl>', html,re.S)[0]#0把列表拨出来
#没有匹配到，涉及到‘*’匹配任意字符，但是不匹配不可见字符，加参数
# ，+re.S（匹配所有字符）.*?非贪婪匹配,加[0]把此行从findall列表里拨出来，

dl = re.findall(r'<dl class="panel-body panel-chapterlist">.*?</dl>', html,re.S)[0]
#print(dl)
#章节列表，提取dl（.*？）匹配捕获返回。
chapter_info_list = re.findall(r'href="(.*?)">(.*?)<',dl)
#循环每一个章节，分别下载
for chapter_info in chapter_info_list:
    # chapter_title = chapter_info[1]
    # chapter_url = chapter_info[0]
    #等于上两句
    chapter_url, chapter_title = chapter_info
    #拼接完整的url
    chapter_url = "http://www.jingcaiyuedu.com%s" % chapter_url
    #print(chapter_info)
    #print(chapter_url, chapter_title)
    #下载章节的内容,拿到了章节的整个的html
    chapter_response = requests.get(chapter_url)
    chapter_response.encoding = 'utf-8'
    chapter_html = chapter_response.text
    #提取章节内容,list
    chapter_content = re.findall(r'<div class="panel-body" id="htmlContent">.*?</div>',
                                 chapter_html,re.S)[0]
    #清洗数据
    chapter_content = chapter_content.replace(" ","")
    chapter_content = chapter_content.replace('&nbsp;','')
    chapter_content = chapter_content.replace('<br/>','')
    chapter_content = chapter_content.replace('<br>','')
    chapter_content = chapter_content.replace('<p>','')
    # 提取章节内容这里不知道为什么多了这写内容，没啥用，替换掉，不知道是不是自己哪里操作错了。
    chapter_content = chapter_content.replace('<divclass="panel-body"id="htmlContent">','')
    chapter_content = chapter_content.replace('</div>','')


    # 数据持久化
    fb.write(chapter_title)
    fb.write('\n')
    fb.write(chapter_content)
    fb.write('\n')

    # print(chapter_content)
    # exit()
    print(chapter_url)

坑点：

1.本来要下载小说所在网页的url,但是没有完整目录，只能找了完整目录所在网址的url。

2.循环每一个章节，要把需要访问的网址和标题从页面中提取出来，网址还要拼接成完整的网址，不然下一步进行不了。

3.正则表达式中，findall返回的是列表，想对其进行清洗和写入不能用列表格式。每个正则后面加【0】，表示从列表中剥离出来。

4.提取章节内容的时候出现了每章都加上了我的正则内容的情况，不知道原因，为了好看，清洗数据时删掉。疑难点—先放着。

最终得到文本数据如下：

第一个爬虫程序，我与爬虫不得不写的博客

心得：爬虫需要好好利用正则，一会把正则再复习下写个文档。

第一个爬虫程序，我与爬虫不得不写的博客

相关推荐