Scrapy - Parse_item没有被调用
问题描述:
我有两个主要的问题Scrapy - Parse_item没有被调用
1)parse_item方法不会被调用/当“回调=‘self.parse_item’”是爬行页面 2)之后执行包括在规则中,scrapy不会继续遵循链接。相反,它只能跟踪从“启动网址”立即可用的链接。
下面是代码
from scrapy.spider import BaseSpider
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from sheprime.items import SheprimeItem
class HerroomSpider(CrawlSpider):
name = "herroom"
allowed_domains = ["herroom.com"]
start_urls = [
"http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml",
"http://www.herroom.com/hosiery.aspx",
rules = [
Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml',)), callback='self.parse_item')
]
def parse_item(self, response):
print "some message"
#I have put in this simple parse function, because I just want to get it to work
感谢您的帮助,
大号
答
您的代码:
Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml',)), callback='self.parse_item')
它应该是:
Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml',)), callback='parse_item')
这个工作对我来说:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class HerroomSpider(CrawlSpider):
name = "herroom"
allowed_domains = ["herroom.com"]
start_urls = [
"http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml",
"http://www.herroom.com/hosiery.aspx"
]
rules = [
Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml',)), callback='parse_item')
]
def parse_item(self, response):
print "some message"
结果:
[email protected]:~/projects/test$ scrapy crawl herroom
2012-07-09 08:08:51+0400 [scrapy] INFO: Scrapy 0.15.1 started (bot: domains_scraper)
2012-07-09 08:08:51+0400 [scrapy] DEBUG: Enabled extensions: LogStats, CloseSpider, CoreStats, SpiderState
2012-07-09 08:08:51+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-09 08:08:51+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-09 08:08:51+0400 [scrapy] DEBUG: Enabled item pipelines: Pipeline
2012-07-09 08:08:51+0400 [herroom] INFO: Spider opened
2012-07-09 08:08:51+0400 [herroom] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-09 08:08:52+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml> (referer: None)
2012-07-09 08:08:54+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/hosiery.aspx> (referer: None)
2012-07-09 08:08:55+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/simone-perele.shtml> (referer: http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml)
some message
2012-07-09 08:08:56+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/simone-perele-12p300-trocadero-strapless-bra.shtml> (referer: http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml)
some message
2012-07-09 08:08:57+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/simone-perele-12p342-trocadero-push-up-bra-with-racerback.shtml> (referer: http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml)
some message
我想这两方面的选择,但不能让其中任何返回任何打印信息。它对你有用吗?也许我错过了另一个问题。 – RunwithLuke 2012-07-08 20:24:15
@RunwithLuke,看我的答案更新。我建议的第二个版本不起作用,因为'self'在类体中是不可用的,但是第一个选项应该可以在你工作的时候使用。 – warvariuc 2012-07-09 04:12:45
感谢您的帮助。经过数小时的混淆后,它最终成为缩进问题!再次感谢你的帮助 – RunwithLuke 2012-07-09 18:23:03