Python的网页抓取脚本,下面的教程和有问题

问题描述:

继在YouTube上的教程: Scraping Web Pages with Scrapy Python的网页抓取脚本,下面的教程和有问题

这是老,为Python 2.x和我学习版本3.x到目前为止,我遇到了几个我通过Google可以找到的问题。不过目前,我得到一个错误:

File "/usr/lib64/python3.5/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/skeer/PycharmProjects/scrape_craigslists/scrape_cl/scrape_cl/spiders/scrape.py", line 11, in parse xpath = scrapy.selector(response) TypeError: 'module' object is not callable

早些时候谷歌搜索,我发现引用到其他有这是由于非大写字符,就好像“S”的选择应该是资本。试了一下,并与scrapy.Selector模块无法找到一个错误迎接。

这里是我的代码:

from scrapy.spider import Spider 
import scrapy.selector 


class MySpider(Spider): 
name = "craigslist" 
allowed_domains = ["craigslist.org"] 
start_urls = ["https://helena.craigslist.org/search/sad"] 

def parse(self, response): 
    xpath = scrapy.selector(response) 
    titles = xpath.select("//p") 
    for titles in titles: 
     title = xpath("/body/section/form/div/li/p[@class]()").extract()  
     link = 
xpath("/body/section/form/div/ul/li/a[@href]").extract() 
     print (title, link) 

scrapy.selector是包含选择的模块。尝试

from scrapy.selector import Selector 

然而,这是因为响应对象已经有selector interface and an xpath method是没有必要的,所以你应该做的:

def parse(self, response): 
    xpath = response.xpath 
    titles = xpath("//p") 
    for titles in titles: 
     title = xpath("/body/section/form/div/li/p[@class]()").extract()  
     link = xpath("/body/section/form/div/ul/li/a[@href]").extract() 
     print (title, link) 

此外,您将需要一个非常好的代理的列表,如果你正计划刮craigslist。他们迅速禁止ip,特别是为了防止刮伤。

我会推荐学习与official docs,还有curated resources

对于您的问题,检查official docs for Scrapy Selectors

from scrapy.selector import Selector 

class MySpider(Spider): 
... 
    def parse(self, response): 
     xpath = Selector(response) 
     ... 

更改函数的定义:

def parse(self, response): 
    xpath = scrapy.selector.Selector(response) 
    titles = xpath.select("//p") 
    for titles in titles: 
     title = xpath.xpath("/body/section/form/div/li/p[@class]()").extract() 
     link = xpath.xpath("/body/section/form/div/ul/li/a[@href]").extract() 
     print(title, link) 

xpath("/body/section/form/div/li/p[@class]()") - >xpath.xpath("/body/section/form/div/li/p[@class]()")