您的位置: 首页 > 文章 > scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

分类: 文章 • 2025-02-09 09:38:10

Scrapy 流程：

scrapy学习笔记（杂1）

模块功能：

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

处理大型爬虫中来自不同的网站的数据：

首先定义不同的爬虫Spider1，spider2，spider3…不同爬虫中具有自身属性

class TestSpider(scrapy.Spider):

name = 'test'

allowed_domains = ['test.com']

start_urls = ['http://test.com/']

在pipeline中函数process_item(self, item, spider):传入的参数spider
通过spider参数确定不同爬虫数据的不同处理方式：

Ex ：一个pipeline 判定并分开处理，或者多个pipeline处理不同的数据：

class MyspiderPipeline(object):

def process_item(self, item, spider):

if spider.name == ‘spider1’:

#todo

return item

class MyspiderPipeline(object):

def process_item(self, item, spider):

if spider.name == ‘spider1’:

#todo

return item

class MyspiderPipeline2(object):

def process_item(self, item, spider):

if spider.name == ‘spider2’:

#todo

return item

****需在setting中添加并设置pipeline等级，等级相当于距离值

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

Ex：可使用open_spider和close_spider 计算爬虫运行时间

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

公共变量，通用设置

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

Return 到另一个距离较远的pipeline

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

scrapy学习笔记（杂1）

、使用Files Pipeline
一般会按照下面的步骤来使用文件管道：
1）在配置文件settings.py中启用FilesPipeline。

ITEM_PIPELINES = {

'tutorial.pipelines.files.FilesPipeline': 1,

}

2）在配置文件settings.py中使用FILES_STORE指定文件存储路径。

# 文件存储路径

FILES_STORE = '/Users/huangtao/Downloads/files'

3）实现ExampleItem（可选），在items.py定义file_urls和files两个字段。

class ExampleItem(Item):

file_urls = Field()

files = Field()

4）实现ExamplesSpider，设置起始爬取点。
parse方法将提取文件的下载URL并返回，一般情况下是把这些URL赋值给ExampleItem的file_urls。

# 获取360的图片信息

class SoSpider(scrapy.Spider):

name = "so"

allowed_domains = ["image.so.com"]

def __init__(self, *args, **kwargs):

super(SoSpider, self).__init__(*args, **kwargs)

self.start_urls = ['http://image.so.com/z?ch=go']

# 这里的parse方法将提取文件的下载URL并返回，一般情况下是把这些URL赋值给ExampleItem的file_urls。

def parse(self, response):

。。。

yield ExampleItem(file_urls = url]) **注：下载非文本文件时 url应该是列表的形式！！！否则会报错 ValueError: Missing scheme in request url: h