无法从黑客新闻中刮取新闻标题

问题描述：

我只想刮掉最新新闻报道的标题并链接黑客新闻。无法从黑客新闻中刮取新闻标题

这里是我的代码：

import scrapy 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 

class HnItem(scrapy.Item): 
    title=scrapy.Field() 
    link=scrapy.Field() 

class HnSpider(scrapy.Spider): 
    name="hn" 
    allowed_domains=["https://news.ycombinator.com"] 
    start_urls=["https://news.ycombinator.com/"] 
    def parse(self,response): 
     item=HnItem() 
     item['title'] = response.xpath('//*[@id="hnmain"]/tbody/tr[3]/td/table/tbody/tr[1]/td[3]/a/text()').extract() 
     item['link'] = response.xpath('//*[@id="hnmain"]/tbody/tr[3]/td/table/tbody/tr[1]/td[3]/a/@href').extract() 
     print item['title'] 
     print item['link']

但这返回一个空列表。

P.S.我是Python和scrapy的初学者。

您是否收到任何具体的错误，当你运行它还是：

import scrapy class HnItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() class HnSpider(scrapy.Spider): name = 'hackernews' allowed_domains = ['news.ycombinator.com'] # see Javier's comment start_urls = ['http://news.ycombinator.com/'] def parse(self,response): sel = scrapy.Selector(response) item=HnItem() # These xPaths can probably be made more generic item['title'] = sel.xpath("//tr[@class='athing']/td[3]/a[@href]/text()").extract() item['link'] = sel.xpath("//tr[@class='athing']/td[3]/a/@href").extract() # Do whatever you want with the item. Print,return, etc.. print item['title'] print item['link']

你可以在命令行与运行此它只是打印空列表？ – Muttonchop

allowed_domains是域的集合，而不是URL。在这种情况下，它应该是allowed_domains = [“news.ycombinator.com”]。不知道这是不是你的问题的原因。 – lufte

答

这里是我结束了，当我试图创建一个蜘蛛：scrapy runspider path/to/your_spider.py

此外，我想添加一些我高度推荐的资源：[scrapy docs]（http://doc.scrapy.org/en/1.0/），[一个很好的示例项目]（https://github.com/scrapy/dirbot）和[W3 Schools xPath intro]（http://www.w3schools.com/xpath/） – Muttonchop

无法从黑客新闻中刮取新闻标题

相关推荐