的Python:硒&PhantomJS

问题描述:

我想凑以下网站: https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0的Python:硒&PhantomJS

我想要得到的文本是:

Showing 114,877 results 

的HTML代码:

<div class="jobs-search-results__count-sort pt3"> 
      <div class="jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4"> 
       Showing 114,877 results 
      </div> 

我python代码是:

index_url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0' 

    java = '!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n.liVisibilityChangeListener))}(document,window);' 
    browser = webdriver.PhantomJS() 
    browser.get(index_url) 
    browser.execute_script(java) 
    soup = BeautifulSoup(browser.page_source, "html.parser") 
    link = "jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4" 
    div = soup.find('div', {"class":link}) 
    text = div.text 

到目前为止,它看起来像我的代码不起作用。我认为这是为了执行java脚本。

我得到以下错误:


AttributeError       Traceback (most recent call last) 
<ipython-input-33-7cdc1c4e0894> in <module>() 
     6 link = "jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4" 
     7 div = soup.find('div', {"class":link}) 
----> 8 text = div.text 

AttributeError: 'NoneType' object has no attribute 'text' 

汤输出:

<html><head>\n<script type="text/javascript">\nwindow.onload = function() {\n // Parse the tracking code from cookies.\n var trk = "bf";\n var trkInfo = "bf";\n var cookies = document.cookie.split("; ");\n for (var i = 0; i < cookies.length; ++i) {\n if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {\n  trk = cookies[i].substring(8);\n }\n else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n  trkInfo = cookies[i].substring(8);\n }\n }\n\n if (window.location.protocol == "http:") {\n // If "sl" cookie is set, redirect to https.\n for (var i = 0; i < cookies.length; ++i) {\n  if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {\n  window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);\n  return;\n  }\n }\n }\n\n // Get the new domain. For international domains such as\n // fr.linkedin.com, we convert it to www.linkedin.com\n var domain = "www.linkedin.com";\n if (domain != location.host) {\n var subdomainIndex = location.host.indexOf(".linkedin");\n if (subdomainIndex != -1) {\n  domain = "www" + location.host.substring(subdomainIndex);\n }\n }\n\n window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +\n  "&originalReferer=" + document.referrer.substr(0, 200) +\n  "&sessionRedirect=" + encodeURIComponent(window.location.href);\n}\n</script>\n</head><body></body></html> 
+0

足够的好奇心,当访问使用'Chrome' webdriver的,在上下文中的文本里面'DIV = soup.find( '格',{ “类”:“result-上下文“})'。当使用'PhantomJS'时,它可能落入模式对话框中。 –

我在webdriver.Chrome的解决办法,因为我从来没用过PhantomJS。有两种情况,如果你想获得结果文本。其中之一是,你从驱动程序实例登录对LinkedIn和其他的是,你还没有登录。

假设您还没有登录,所以下面的代码将完成您的工作

from selenium import webdriver 
from bs4 import BeautifulSoup 
driver = webdriver.Chrome() 
url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0' 
driver.get(url) 
soup = BeautifulSoup(driver.page_source, 'html.parser') 
text = soup.find('div',{'class':'results-context'}).text 
print(text) 

假设你登录

from selenium import webdriver 
from bs4 import BeautifulSoup 
driver = webdriver.Chrome() 
url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0' 
driver.get(url) 
soup = BeautifulSoup(driver.page_source, 'html.parser') 

class = 'jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4' 
text = soup.find('div',{'class':class}).text.split('\n')[1].lstrip() 
print(text)