爬虫就业指导(框架篇)Scrapy爬虫框架(1)--深入了解scrapy框架机理
作为一名爬虫工程师,scrapy是必须掌握的框架
以下记录的均是本人的学习历程,仅供大家参考!!
scrapy文档:https://doc.scrapy.org/en/latest/topics/architecture.html
目标:了解scrapy的架构与执行流程
一,核心架构
一个整体的scrapy包含以下结构:
1,scrapy Engine (引擎):负责调控信号与数据在各个组件间的传递,它是框架的核心
2,scrapy Scheduler (爬虫调度器) :调度等待爬取的 网址 的顺序(优先级)
3,Downloader Middlewares (下载中间件):在引擎与下载器中间进行一系列处理,包括设置代理,请求头等等
4,Downloader (下载器) :将下载网址的响应返回给引擎,引擎再返回给爬虫
5,Spider Middlewares(爬虫中间件):对引擎和爬虫之间的通信进行处理。
6,Spider(爬虫):对响应response进行处理,提取出所需的数据(可以存入items),也可以提取出接下来要爬取的网址。
7,Item Pipline(实体管道):接收从爬虫中提取出来的item,并对item进行处理(清洗、验证、存储到数据库等)
二,执行流程
引用一张scrapy文档中的一张图
文档原文
The data flow in Scrapy is controlled by the execution engine, and goes like this:
- The Engine gets the initial Requests to crawl from the Spider.
- The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
- The Scheduler returns the next Requests to the Engine.
- The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see
process_request()
). - Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see
process_response()
). - The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see
process_spider_input()
). - The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see
process_spider_output()
). - The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
- The process repeats (from step 1) until there are no more requests from the Scheduler.
所有数据的传输都是由引擎来调控执行的
1,首先通过爬虫传递最初始requests给引擎
2,引擎传递requests给调度器,并询问下一个requests
3,调度程序把所有经过排序的requests返回给引擎
4,引擎依次将requests传递给下载器(可经过下载中间件处理,也可不处理)
5,当页面完成下载,下载器得到响应response,将response传给引擎
6,引擎接收到response后经过爬虫中间件(也可不经过)传递给爬虫
7,将所需的Items原路返回到Engine
8,Engine将Items发送到ItemPipeline(可以是本地,可以是数据库),完成一次循环,并请求下一次循环
9,从第一步开始,直到调度器没有requests请求时结束!