Python自学:CrawlSpider基础

创建CrawlSpider爬虫:
scrapy genspider -t crawl [爬虫名字] [域名]
Python自学:CrawlSpider基础
Python自学:CrawlSpider基础
Python自学:CrawlSpider基础
Python自学:CrawlSpider基础
Python自学:CrawlSpider基础
导入
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wxapp.items import WxappItem

from scrapy.http.response.html import HtmlResponse

rules中应使用正则

allowed_domains:允许爬取的域名范围

follow:如果在爬取时候,需要将满足当前条件的url再进行跟进,那么就设置为True(eg:爬取页面)[可防止爬取到页面中具有类似链接的url]
Python自学:CrawlSpider基础
callback:将爬取的url传递给特定函数进行解析
Python自学:CrawlSpider基础