Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)

Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)

1. Scrapy框架

  Scrapy是python下实现爬虫功能的框架,能够将数据解析、数据处理、数据存储合为一体功能的爬虫框架。

2. Scrapy安装

1. 安装依赖包

1

2

yum install gcc libffi-devel python-devel openssl-devel -y

yum install libxslt-devel -y

 2. 安装scrapy

1

pip install scrapy<br>pip install twisted==13.1.0

 注意事项:scrapy和twisted存在兼容性问题,如果安装twisted版本过高,运行scrapy startproject project_name的时候会提示报错,安装twisted==13.1.0即可。

3. 基于Scrapy爬取数据并存入到CSV

3.1. 爬虫目标,获取简书中热门专题的数据信息,站点为https://www.jianshu.com/recommendations/collections,点击"热门"是我们需要爬取的站点,该站点使用了AJAX异步加载技术,通过F12键——Network——XHR,并翻页获取到页面URL地址为https://www.jianshu.com/recommendations/collections?page=2&order_by=hot,通过修改page=后面的数值即可访问多页的数据,如下图:

Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)

3.2. 爬取内容

  需要爬取专题的内容包括:专题内容、专题描述、收录文章数、关注人数,Scrapy使用xpath来清洗所需的数据,编写爬虫过程中可以手动通过lxml中的xpath获取数据,确认无误后再将其写入到scrapy代码中,区别点在于,scrapy需要使用extract()函数才能将数据提取出来。

3.3 创建爬虫项目

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

[[email protected] jianshu_hot_topic]# scrapy startproject jianshu_hot_topic

 

#项目目录结构如下:

[[email protected] python]# tree jianshu_hot_topic

jianshu_hot_topic

├── jianshu_hot_topic

│   ├── __init__.py

│   ├── __init__.pyc

│   ├── items.py

│   ├── items.pyc

│   ├── middlewares.py

│   ├── pipelines.py

│   ├── pipelines.pyc

│   ├── settings.py

│   ├── settings.pyc

│   └── spiders

│       ├── collection.py

│       ├── collection.pyc

│       ├── __init__.py

│       ├── __init__.pyc

│       ├── jianshu_hot_topic_spider.py    #手动创建文件,用于爬虫数据提取

│       └── jianshu_hot_topic_spider.pyc

└── scrapy.cfg

 

2 directories, 16 files

[[email protected] python]#

 3.4 代码内容

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

1. items.py代码内容,定义需要爬取数据字段

# -*- coding: utf-8 -*-

 

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

 

import scrapy

from scrapy import Item

from scrapy import Field

 

class JianshuHotTopicItem(scrapy.Item):

    '''

    @scrapy.item,继承父类scrapy.Item的属性和方法,该类用于定义需要爬取数据的子段

    '''

    collection_name = Field()

    collection_description = Field()

    collection_article_count = Field()

    collection_attention_count = Field()

 

2. piders/jianshu_hot_topic_spider.py代码内容,实现数据获取的代码逻辑,通过xpath实现

[[email protected] jianshu_hot_topic]# cat spiders/jianshu_hot_topic_spider.py

#_*_ coding:utf8 _*_

 

import random

from time import sleep

from scrapy.spiders import CrawlSpider

from scrapy.selector import Selector

from scrapy.http import Request

from jianshu_hot_topic.items import JianshuHotTopicItem

 

class jianshu_hot_topic(CrawlSpider):

    '''

    简书专题数据爬取,获取url地址中特定的子段信息

    '''

    name = "jianshu_hot_topic"

    start_urls = ["https://www.jianshu.com/recommendations/collections?page=2&order_by=hot"]

 

    def parse(self,response):

        '''

        @params:response,提取response中特定字段信息

        '''

        item = JianshuHotTopicItem()

        selector = Selector(response)

        collections = selector.xpath('//div[@class="col-xs-8"]')   

        for collection in collections:

            collection_name = collection.xpath('div/a/h4/text()').extract()[0].strip()

                    collection_description = collection.xpath('div/a/p/text()').extract()[0].strip()

                    collection_article_count = collection.xpath('div/div/a/text()').extract()[0].strip().replace('篇文章','')

                    collection_attention_count = collection.xpath('div/div/text()').extract()[0].strip().replace("人关注",'').replace("· ",'')

            item['collection_name'= collection_name

            item['collection_description'= collection_description

            item['collection_article_count'= collection_article_count

            item['collection_attention_count'= collection_attention_count

 

            yield item

         

         

        urls = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(str(i)) for in range(3,11)]

        for url in urls:

            sleep(random.randint(2,7))

            yield Request(url,callback=self.parse)

 

3. pipelines文件内容,定义数据存储的方式,此处定义数据存储的逻辑,可以将数据存储载MySQL数据库,MongoDB数据库,文件,CSV,Excel等存储介质中,如下以存储载CSV为例:

[[email protected] jianshu_hot_topic]# cat pipelines.py

# -*- coding: utf-8 -*-

 

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

 

 

import csv

 

class JianshuHotTopicPipeline(object):

    def process_item(self, item, spider):

        = file('/root/zhuanti.csv','a+')

    writer = csv.writer(f)

    writer.writerow((item['collection_name'],item['collection_description'],item['collection_article_count'],item['collection_attention_count']))

        return item

 

4. 修改settings文件,

ITEM_PIPELINES = {

    'jianshu_hot_topic.pipelines.JianshuHotTopicPipeline'300,

}

 3.5 运行scrapy爬虫

  返回到项目scrapy项目创建所在目录,运行scrapy crawl spider_name即可,如下:

1

2

3

4

5

6

[[email protected] jianshu_hot_topic]# pwd

/root/python/jianshu_hot_topic

[[email protected] jianshu_hot_topic]# scrapy crawl jianshu_hot_topic

2018-02-24 19:12:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: jianshu_hot_topic)

2018-02-24 19:12:23 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 13.1.0, Python 2.7.5 (default, Aug  4 201700:39:18- [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)], pyOpenSSL 0.13.1 (OpenSSL 1.0.1e-fips 11 Feb 2013), cryptography 1.7.2, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core

2018-02-24 19:12:23 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE''jianshu_hot_topic.spiders''SPIDER_MODULES': ['jianshu_hot_topic.spiders'], 'ROBOTSTXT_OBEY'True'USER_AGENT''Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0''BOT_NAME''jianshu_hot_topic'}

 查看/root/zhuanti.csv中的数据,即可实现。

 

4. 遇到的问题总结

1. twisted版本不见容,安装过新的版本导致,安装Twisted (13.1.0)即可

2. 中文数据无法写入,提示'ascii'错误,通过设置python的encoding为utf即可,如下:

1

2

3

4

5

6

7

8

>>> import sys

>>> sys.getdefaultencoding()

'ascii'

>>> reload(sys)

<module 'sys' (built-in)>

>>> sys.setdefaultencoding('utf8')

>>> sys.getdefaultencoding()

'utf8'

 3. 爬虫无法获取站点数据,由于headers导致,载settings.py文件中添加USER_AGENT变量,如:

1

USER_AGENT="Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"

 Scrapy使用过程中可能会遇到结果执行失败或者结果执行不符合预期,其现实的logs非常详细,通过观察日志内容,并结合代码+网上搜索资料即可解决。