从源码开始学习Scrapy系列03-bench指令及指令运行机制
代码调试
接着上一章节,在cmdline模块中的main方法中或者直接在__main__模块中的main方法中,改成如下代码:
if __name__ == '__main__': execute(['scrapy','bench'])
并在此方法处下断点,随后继续debug模式调试
1.获取项目配置:
if settings is None: settings = get_project_settings() # set EDITOR from environment if available try: editor = os.environ['EDITOR'] except KeyError: pass else: settings['EDITOR'] = editor
(读取系统默认的default_settings配置文件,并存储到内存中
读取项目的cfg文件,通过读取cfg的[settings]选项下的default属性,获取项目的配置文件,并将重复配置覆盖之前内存中的default_settings配置
['/etc/scrapy.cfg', 'c:\\scrapy\\scrapy.cfg', '/home/wangsir/.config/scrapy.cfg', '/home/wangsir/.scrapy.cfg']
['/etc/scrapy.cfg', 'c:\\scrapy\\scrapy.cfg', '/home/wangsir/.config/scrapy.cfg', '/home/wangsir/.scrapy.cfg', '/home/wangsir/code/sourceWorkSpace/scrapy/scrapy.cfg'])
2.通过以下方法查看你当前执行命令的位置是否在一个scrapy项目内:
inproject = inside_project()
(通过对比每个指令类的requires_project属性,完成是否可以显示该指令或者说该指令是否可用的判定,这也就是为什么有的指令没有显示在scrapy -help中)
3.获取命令字典:
cmds = _get_commands_dict(settings, inproject)
4.返回当前命令名字(key=命令:value=实例对象):
cmdname = _pop_command_name(argv)
5.初始化一个指令解析对象:
parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \ conflict_handler='resolve')
6.指令判定:
if not cmdname: _print_commands(settings, inproject) sys.exit(0) elif cmdname not in cmds: _print_unknown_command(settings, cmdname, inproject) sys.exit(2)
7.获取指令对象:
cmd = cmds[cmdname]
8.指令语法:
parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax())
9.指令描述:
parser.description = cmd.long_desc()
10.保存cmd中的settings配置项,并将重复项覆盖之前全局的settings:
class Command(ScrapyCommand): default_settings = { 'LOG_LEVEL': 'INFO', 'LOGSTATS_INTERVAL': 1, 'CLOSESPIDER_TIMEOUT': 10, }
settings.setdict(cmd.default_settings, priority='command') cmd.settings = settings
11.给指令添加相应指令选项:
cmd.add_options(parser)
12.解析指令参数和选项:
opts, args = parser.parse_args(args=argv[1:]) run_print_help(parser, cmd.process_options, args, opts)
13.创建爬虫进程对象,并运行指令内容:
cmd.crawler_process = CrawlerProcess(settings) run_print_help(parser, _run_command, cmd, args, opts)
14.命令退出,系统结束:
sys.exit(cmd.exitcode)
至此以上步骤,算是把指令运行的解析过程和机制大致阐述了一遍,如果你要看更具体的,可以参照我github的源码解析,基本相关的每个方法都有解释说明。
恩,还没有结束,先放下你的手里的刀0,0 下面我们从上面的第13步开始,去真正了解bench命令是干嘛用的。
通过断点深入,我们来到了以下方法
运行指令
def _run_command(cmd, args, opts): if opts.profile: _run_command_profiled(cmd, args, opts) else: cmd.run(args, opts)
(默认opts参数是空的,所以直接进入了run方法,也就是bench指令的run方法)
def run(self, args, opts): with _BenchServer(): self.crawler_process.crawl(_BenchSpider, total=100000) self.crawler_process.start()我们先关注BeanchServer
class _BenchServer(object): def __enter__(self): from scrapy.utils.test import get_testenv pargs = [sys.executable, '-u', '-m', 'scrapy.utils.benchserver'] self.proc = subprocess.Popen(pargs, stdout=subprocess.PIPE, env=get_testenv()) self.proc.stdout.readline() def __exit__(self, exc_type, exc_value, traceback): self.proc.kill() self.proc.wait() time.sleep(0.2)
(这里是通过子进程的方式打开了,scrapy.utils.benchserver模块)
于是来到scrapy.utils.benchserver模块(起了一个监听在端口8998的reactor server,来接收测试的http请求)
if __name__ == '__main__': root = Root() factory = Site(root) httpPort = reactor.listenTCP(8998, Site(root)) def _print_listening(): httpHost = httpPort.getHost() print("Bench server at http://{}:{}".format(httpHost.host, httpHost.port)) reactor.callWhenRunning(_print_listening) reactor.run()
它的render方法,用于生成benchserver的页面
def render(self, request): total = _getarg(request, b'total', 100, int) show = _getarg(request, b'show', 10, int) nlist = [random.randint(1, total) for _ in range(show)] request.write(b"<html><head></head><body>") args = request.args.copy() for nl in nlist: args['n'] = nl argstr = urlencode(args, doseq=True) request.write("<a href='/follow?{0}'>follow {1}</a><br>" .format(argstr, nl).encode('utf8')) request.write(b"</body></html>") return b''
页面是这样的
以上就是benchserver的初始化相当于
之后继续看下面的代码
self.crawler_process.crawl(_BenchSpider, total=100000) self.crawler_process.start()
这里看到了涉及到了benchSpider,这其实就是一个最简单的爬虫定义,和我们一般写的爬虫逻辑是一样的道理
class _BenchSpider(scrapy.Spider): """A spider that follows all links""" name = 'follow' total = 10000 show = 20 baseurl = 'http://localhost:8998' link_extractor = LinkExtractor() def start_requests(self): qargs = {'total': self.total, 'show': self.show} url = '{}?{}'.format(self.baseurl, urlencode(qargs, doseq=1)) return [scrapy.Request(url, dont_filter=True)] def parse(self, response): for link in self.link_extractor.extract_links(response): yield scrapy.Request(link.url, callback=self.parse)
(这个东西就是要向之前起得benchserver发送批量请求,达到检测爬虫可用性,下载速率,日志数量,队列情况,等检验效果)
代码再继续运行就是如何调度这个benchspider的问题了,这就涉及到核心的爬虫调度问题了,准备放到后面再做讲解,说多了一下子接收不了(其实是说不动了,要进食去了)
github地址
https://github.com/wangrenlei/debug_scrapy