python引包以及pyCharm运行scrapy方法
接上日scrapy爬虫
Section1 用xpath 抽取数据
import scrapy from mySpiderOne.mySpiderOne.items import MyspideroneItem class TiebaspiderSpider(scrapy.Spider): name = 'tiebaSpider' allowed_domains = ['tieba.baidu.com'] start_urls = ['https://tieba.baidu.com/f?kw=%E5%9C%A8%E5%AE%B6%E8%B5%9A%E9%92%B1'] def parse(self, response): # filename = "tieba.html" # open(filename, "wb+").write(response.body) items = [] for each in response.xpath("//li[@class=' j_thread_list clearfix']//div[@class='threadlist_lz clearfix']"): print (each.extract()) # 将我们得到的数据封装到一个 `Item` 对象 return items
scrapy crawl tiebaSpider
C:\Users\Administrator\PycharmProjects\mySpider\mySpiderOne\mySpiderOne>scrapy c
rawl tiebaSpider
Traceback (most recent call last):
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\run
py.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\run
py.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\Scripts
\scrapy.exe\__main__.py", line 9, in <module>
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\cmdline.py", line 148, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\crawler.py", line 243, in __init__
super(CrawlerProcess, self).__init__(settings)
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\crawler.py", line 134, in __init__
self.spider_loader = _get_spider_loader(settings)
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\crawler.py", line 330, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\spiderloader.py", line 61, in from_settings
return cls(settings)
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\spiderloader.py", line 25, in __init__
self._load_all_spiders()
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\spiderloader.py", line 47, in _load_all_spiders
for module in walk_modules(name):
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\utils\misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\imp
ortlib\__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 978, in _gcd_import
File "<frozen importlib._bootstrap>", line 961, in _find_and_load
File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 678, in exec_module
File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
File "C:\Users\Administrator\PycharmProjects\mySpider\mySpiderOne\mySpiderOne\
spiders\tiebaSpider.py", line 4, in <module>
from mySpiderOne.mySpiderOne.items import MyspideroneItem
ModuleNotFoundError: No module named 'mySpiderOne.mySpiderOne'
Section2 修改导包方式
提示没有找到对应的模块,我们修改下导包的方式
from ..items import MyspideroneItem
其中,一个点代表当前目录,每多一个点则代表向上一层目录
再次运行 scrapy crawl tiebaSpider
这次终于有数据了。
2017-08-23 22:54:20 [scrapy.core.engine] INFO: Closing spider (finished) 2017-08-23 22:54:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 532, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 52996, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 8, 23, 14, 54, 20, 98239), 'log_count/DEBUG': 3, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 8, 23, 14, 54, 18, 103125)} 2017-08-23 22:54:20 [scrapy.core.engine] INFO: Spider closed (finished)
Section3 让PyCharm直接运行spider
每次都在命令行启动spider是不是有点麻烦,我没配置一下pyCharm,让以后启动可以直接run新建start.py
内容为
from scrapy import cmdline cmdline.execute("scrapy crawl tiebaSpider".split())
现在,直接点击三角形的Run 按钮就可以直接运行爬虫了。