刮网页包含锚标记 using scrapy
using scrapy" src="/default/index/img?u=aHR0cHM6Ly9wMi5waXFzZWxzLmNvbS9wcmV2aWV3Lzk5MS83MTcvMTg4L3plYnJhLXN0cmlwZXMtYmFyLWJsYWNrLmpwZw==&w=245&h=&w=700"/>
问题描述:
I am scraping manulife刮网页包含锚标记<a href = "#"> using scrapy
我想进入下一个页面,当我检查了“下一个”我得到:
<span class="pagerlink">
<a href="#" id="next" title="Go to the next page">Next</a>
</span>
还有什么是正确的做法遵循?
# -*- coding: utf-8 -*-
import scrapy
import json
from scrapy_splash import SplashRequest
class Manulife(scrapy.Spider):
name = 'manulife'
#allowed_domains = ['https://manulife.taleo.net/careersection/external_global/jobsearch.ftl?lang=en']
start_urls = ['https://manulife.taleo.net/careersection/external_global/jobsearch.ftl?lang=en&location=1038']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
args={'wait': 5},
)
def parse(self, response):
#yield {
# 'demo' : response.css('div.absolute > span > a::text').extract()
# }
urls = response.css('div.absolute > span > a::attr(href)').extract()
for url in urls:
url = "https://manulife.taleo.net" + url
yield SplashRequest(url = url, callback = self.parse_details, args={'wait': 5})
#self.log("reaced22 : "+ url)
#hitting next button
#data = json.loads(response.text)
#self.log("reached 22 : "+ data)
#next_page_url =
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield SplashRequest(url = next_page_url, callback = self.parse, args={'wait': 5})
def parse_details(self,response):
yield {
'Job post' : response.css('div.contentlinepanel > span.titlepage::text').extract(),
'Location' : response.xpath("//span[@id = 'requisitionDescriptionInterface.ID1679.row1']/text()").extract(),
'Organization' : response.xpath("//span[@id = 'requisitionDescriptionInterface.ID1787.row1']/text()").extract(),
'Date posted' : response.xpath("//span[@id = 'requisitionDescriptionInterface.reqPostingDate.row1']/text()").extract(),
'Industry': response.xpath("//span[@id = 'requisitionDescriptionInterface.ID1951.row1']/text()").extract()
}
正如您所看到的,代码包含SplashRequest,同时打到下一页的链接。
我是刮菜的新手,在某处我发现该网站可以返回响应为json也。我试过了,但它给我的错误,“没有JSON对象可以解码”
答
我认为使用CSS选择器".pagerlink a[title='Go to the next page']"
这样可以工作。
但".pagerlink:last-child a"
将是最好的方法。你只需要获得href属性
+0
这只是给出了包含“#”的锚标签。所以它没用。 :/ –
我已经尝试过使用scrapy-splash,但是没有结果。 –
scrapy无法解释JavaScript,请将硒用于此类事情。 – shotgunner
我已经使用了用于处理javascript请求的scrapy-splash。 @shotgunner –