实战 | python spiders 网络爬虫笔记

需求

1.要批量抓取某网站的酒店数据

准备工作

(1).csrapy安装:

1.先安装python3.6  官网下载
2.安装pywin32   https://sourceforge.net/projects/pywin32/files/pywin32/   (找相应版本下载)
3.安装lxml     https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml    (下载相应版本)
4.双击运行pywin32程序,下一步安装玩成
5.去该文件下面,按shift键,鼠标右键盘在此打开命令行窗口
6.pip install lxml(tab键补全)
6_1.pip install Twisted-17.9.0-cp36-cp36m-win_amd64.whl //版本不支持,要另外安装
7.pip install csrapy

(2)命令交互,熟悉下工具命令

1.#scrapy shell http://gz.ganji.com/fang1/     #试下爬行这个网址看能否抓回网页
2.#response         #成功会返回response
3.#view(response)   #查看
4.#response.xpath("//*[@id='puid-3026077037']/dl/dd[5]/div[1]/span[1]").extract()  ////*[@id="puid-3026077037"]/dl/dd[5]/div[1]/span[1]元素ID,注意里面双引号改为单引号,拿元素
5.#response.xpath("//*[@id='puid-3026077037']/dl/dd[5]/div[1]/span[1]/text()").extract() //获取单个的思路
6.修改元素定位,获取列表"//*[@id='puid-3026077037']/dl/dd[5]/div[1]/span[1]" --> "//div[@class='f-list-item ershoufang-list']/dl/dd[5]/div[1]/span[1]/text()"
7.#response.xpath("//div[@class='f-list-item ershoufang-list']/dl/dd[5]/div[1]/span[1]/text()").extract() //匹配正个列表的价格
8.#response.xpath("//div[@class='f-list-item ershoufang-list']/dl/dd[1]/a/text()").extract()  //匹配正个列表的标题

9.#len()查看长度

开始上代码

(1)创建csrapy工程

1.#scrapy startproject zufang    //创建工程命令
2打开PyCharm开发工具  //作者用的工具,很好用

3.打开zufang目录,看图片

实战 | python spiders 网络爬虫笔记

4.在spiders 里面添加zufang_detail.py文件  //文件名可以自己命名
5.在zufang_detail.py里面编码
6.用自带的Terminal 命令执行命令
7.#scrapy list //查看爬虫列表

8.#scrapy crawl zufang_detail  //开始爬虫

实战 | python spiders 网络爬虫笔记

代码没错执行正常的显示,如果有错误可以百度下,资料很多,可能每个人错误都不同,这里不做错误演示,自己解决才能真正学到

实战 | python spiders 网络爬虫笔记

(2)代码解说

1.目录结构:

实战 | python spiders 网络爬虫笔记

创建csrapy工程会创建了上面除了zufang_detail.py文件外的所有文件,详细请参考操作手册:http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html

2.items.py //数据字段名,就是一个类,在里面可以设定属性,比如我工拿租房信息,可以在里面设定标题,价格,等属性

# -*- coding: utf-8 -*-
import scrapy
class ZufangDetailItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()  #标题
    money = scrapy.Field()#价格
    content = scrapy.Field()#内容
    address = scrapy.Field()#地址
    other = scrapy.Field()#其他信息
    imgurl = scrapy.Field()#图片链接
    url = scrapy.Field()#
    phone = scrapy.Field()#联系方式
    filename = scrapy.Field()
    id = scrapy.Field()
    pass

3.pipelines.py //管道文件,可以提取在items设定属性的数据,在这个文件存数据库或导出excel文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from openpyxl import Workbook

class ZufangDetailPipeline(object):
    def open_spider(self,spider):
        #保存数据库,需要存数据库可以把注释代码放开
        # self.con = pymysql.connect(host="localhost",port=3306,user="root",passwd="123456",db="ganji", charset="utf8")
        # self.cu = self.con.cursor()

        # 保存excel
        self.wb = Workbook()
        self.ws = self.wb.active
        self.ws.append(['标题','价格','简要信息','其他信息','小区信息','联系电话','链接'])# 设置表头
    def process_item(self, item, spider):
        #存数据库
        # print(spider.name,'pipelines')
        # insert_sql = "insert into zufang (money,title,content,other,address,phone,url)  values('{}','{}','{}','{}','{}','{}','{}')"\
        #     .format(item['money'],item['title'],item['content'],item['other'],item['address'],item['phone'],item['url'])
        # try:
        #     # 执行sql语句
        #     print(insert_sql)
        #     self.cu.execute(insert_sql)
        #     # 提交到数据库执行
        #     self.con.commit()
        # except:
        #     # 如果发生错误则回滚
        #     self.con.rollback()
        # return item
        # 保存excel
        line = [item['title'],item['money'],item['content'],item['other'],item['address'],item['phone'],item['url']]
        self.ws.append(line) # 将数据以行的形式添加到xlsx中
        self.wb.save(item['filename']+'.xlsx')
        return item

    def spider_close(self,spider):
        self.con.close()

4.settings.py //工程文件一些设定

核心的:

BOT_NAME = 'zufang_detail' #爬虫工程文件名,创建工程的时候默认填好
ITEM_PIPELINES = {
   'zufang_detail.pipelines.ZufangDetailPipeline': 300,#保存项目中启用的pipeline及其顺序的字典。该字典默认为空,值(value)任意。 不过值(value)习惯设定在0-1000范围内。
}

5.zufang_detail.py //业务代码具体实现

import scrapy
from ..items import ZufangDetailItem
from scrapy import Selector 
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import os
import random
import string
import scrapy.settings

class GanjiSpider(scrapy.Spider):
    name = "zufang_detail"  #不懂的不用改,和工程名一样
    start_urls = ['http://gz.ganji.com/fang1/tianhe/'] #要爬目标网页地址

    def parse(self, response): #解析方法

        # 详情页//*[@id="f_detail"]/div[6]/div[2]/div[1]/div/div[2]/ul/li[1]
        url_list = response.xpath("//div[@class='f-list-item ershoufang-list']/@href").extract()
        # print(url_list)
        # return
        # yield scrapy.Request(response.urljoin(url_list[1]), callback=self.parse_detail, meta={'url': url_list[1]})
        for href in url_list:
            yield scrapy.Request(response.urljoin(href), callback=self.parse_detail, meta={'url': href})
        #分页处理,有需要分页可以打开注释
        # next_page_11 = response.xpath("*[@id='f_mew_list']/div[6]/div[1]/div[4]/div/div/ul/li[11]/a/@href").extract()
        # next_page_12 = response.xpath("*[@id='f_mew_list']/div[6]/div[1]/div[4]/div/div/ul/li[12]/a/@href").extract()
        # if next_page_11 is not None:
        #     next_page = next_page_11[0]
        # else:
        #     next_page = next_page_12[0]
        # if next_page is not None:
        #     next_page_new = response.urljoin(next_page)
        #     time.sleep(5)
        #     yield scrapy.Request(next_page_new, callback=self.parse)


    # 负责子页面内容的爬取,上面是获取列表信息,这里是从列表进到详页面
    def parse_detail(self, response):
        item = ZufangDetailItem()
        item['url'] = response.meta['url']
        item['title'] = ''.join(response.xpath("//*[@id='f_detail']/div[5]/div[2]/div[2]/div[1]/p[1]/i/text()").extract())
        item['money'] = ''.join(response.xpath("//*[@id='f_detail']/div[5]/div[2]/div[2]/div[1]/ul[1]/li[1]/span[2]/text()").extract())

        #房子信息
        content_list_htmls = response.xpath("//*[@id='f_detail']/div[5]/div[2]/div[2]/div[1]/ul[2]").extract()
        content_list = []
        for content_list_html in content_list_htmls:
            sel = Selector(text=content_list_html, type="html")
            spans_str = "|".join(sel.xpath('//li/span[2]/text()').extract())
            content_list.append(spans_str.replace("&nbsp","|"))
        item['content'] = content_list[0]
        #地址信息
        xiaoqu = response.xpath( "//*[@id='f_detail']/div[5]/div[2]/div[2]/div[1]/ul[3]/li[1]/span[2]/a/text()").extract()
        address1 = response.xpath("//*[@id='f_detail']/div[5]/div[2]/div[2]/div[1]/ul[3]/li[3]/span[2]/a[1]/text()").extract()
        address2 = response.xpath("//*[@id='f_detail']/div[5]/div[2]/div[2]/div[1]/ul[3]/li[3]/span[2]/a[2]/text()").extract()
        address3 = response.xpath("//*[@id='f_detail']/div[5]/div[2]/div[2]/div[1]/ul[3]/li[3]/span[2]/a[3]/text()").extract()
        if address3 is None:
            address3 = response.xpath("//*[@id='f_detail']/div[5]/div[2]/div[2]/div[1]/ul[3]/li[2]/span[2]/span/text()").extract()
            # print(content_list)
        item['address'] = ''.join(xiaoqu)+"|"+''.join(address1)+"-"+''.join(address2)+"-"+''.join(address3)
        #交通信息
        item['other'] = ''.join(response.xpath("//*[@id='f_detail']/div[5]/div[2]/div[2]/div[1]/ul[3]/li[2]/div/span[1]/text()").extract())
        # 获取电话
        item['phone'] = ''.join(response.xpath("//*[@id='full_phone_show']/@data-phone").extract())
        #保存的文件名
        item['filename']="天河区租房信息"
        item['id']= ''.join(random.sample(string.ascii_letters + string.digits, 8))

        #图片url
        img_urls = response.xpath("//*[@id='f_detail']/div[5]/div[2]/div[1]/div/div[2]/ul/li/@data-image").extract()
        self.get_img(img_urls,item)
        yield item
        # print(item)

    def get_img(self,imgurls,item): #爬图片
        path = 'D:\lin\csrapy\zufang_detail\\'
        if not os.path.exists(path + item['filename']):
            os.mkdir(item['filename'])
            os.chdir(item['filename'])
        else:
            os.chdir(path + item['filename'])
        n = 1
        for img_url in imgurls:
            time.sleep(3)
            if n < 3 :
                if not os.path.exists(path + item['filename'] + item['id'] + '_%s.jpg'):
                    try:
                        urllib.request.urlretrieve(img_url, item['id'] + '_%s.jpg' % n)  # prython3.6写法,把图片存到本地目录
                    except:
                        return
                    print(n)
                    n += 1
            else:
                return

(3)实战效果

1.获取详细信息打印

实战 | python spiders 网络爬虫笔记

2.最终会在本地目录创建个文件夹存数据

实战 | python spiders 网络爬虫笔记

实战 | python spiders 网络爬虫笔记

实战 | python spiders 网络爬虫笔记

电话好像有点问题,读者自己解决.

后续:写得不好,请大牛指正错误

工具下载链接1:https://download.csdn.net/download/u012728971/10476750

工具下载链接2:https://download.csdn.net/download/u012728971/10476780

(工具太大,分2部)

源码下载链接:https://download.csdn.net/download/u012728971/10476712