无法从黑客新闻中刮取新闻标题

无法从黑客新闻中刮取新闻标题

问题描述:

我只想刮掉最新新闻报道的标题并链接黑客新闻。无法从黑客新闻中刮取新闻标题

这里是我的代码:

import scrapy 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 

class HnItem(scrapy.Item): 
    title=scrapy.Field() 
    link=scrapy.Field() 

class HnSpider(scrapy.Spider): 
    name="hn" 
    allowed_domains=["https://news.ycombinator.com"] 
    start_urls=["https://news.ycombinator.com/"] 
    def parse(self,response): 
     item=HnItem() 
     item['title'] = response.xpath('//*[@id="hnmain"]/tbody/tr[3]/td/table/tbody/tr[1]/td[3]/a/text()').extract() 
     item['link'] = response.xpath('//*[@id="hnmain"]/tbody/tr[3]/td/table/tbody/tr[1]/td[3]/a/@href').extract() 
     print item['title'] 
     print item['link'] 

但这返回一个空列表。

P.S.我是Python和scrapy的初学者。

+0

您是否收到任何具体的错误,当你运行它还是:

import scrapy class HnItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() class HnSpider(scrapy.Spider): name = 'hackernews' allowed_domains = ['news.ycombinator.com'] # see Javier's comment start_urls = ['http://news.ycombinator.com/'] def parse(self,response): sel = scrapy.Selector(response) item=HnItem() # These xPaths can probably be made more generic item['title'] = sel.xpath("//tr[@class='athing']/td[3]/a[@href]/text()").extract() item['link'] = sel.xpath("//tr[@class='athing']/td[3]/a/@href").extract() # Do whatever you want with the item. Print,return, etc.. print item['title'] print item['link'] 

你可以在命令行与运行此它只是打印空列表? – Muttonchop

+1

allowed_domains是域的集合,而不是URL。在这种情况下,它应该是allowed_domains = [“news.ycombinator.com”]。不知道这是不是你的问题的原因。 – lufte

这里是我结束了,当我试图创建一个蜘蛛:scrapy runspider path/to/your_spider.py

+0

此外,我想添加一些我高度推荐的资源:[scrapy docs](http://doc.scrapy.org/en/1.0/),[一个很好的示例项目](https://github.com/scrapy/dirbot)和[W3 Schools xPath intro](http://www.w3schools.com/xpath/) – Muttonchop