Python自学：CrawlSpider基础

创建CrawlSpider爬虫：
scrapy genspider -t crawl [爬虫名字] [域名]
Python自学：CrawlSpider基础

导入
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wxapp.items import WxappItem
from scrapy.http.response.html import HtmlResponse

rules中应使用正则

allowed_domains:允许爬取的域名范围

follow：如果在爬取时候，需要将满足当前条件的url再进行跟进，那么就设置为True（eg:爬取页面）[可防止爬取到页面中具有类似链接的url]
Python自学：CrawlSpider基础
callback:将爬取的url传递给特定函数进行解析

Python自学：CrawlSpider基础

相关推荐