BeautifulSoup与“加载更多”的分页列表

问题描述:

很新的在这里,所以提前道歉。我期待从https://angel.co/companies获得所有公司描述的清单,以便与之一起玩。我试过的基于网络的解析工具并没有削减它,所以我正在寻找一个简单的Python脚本。我应该从获取所有公司URL的数组开始,然后循环访问它们吗?任何资源或方向将有所帮助 - 我已浏览BeautifulSoup的文档和一些帖子/视频教程,但我越来越挂上模拟json请求等(请参阅:Get all links with BeautifulSoup from a single page website ('Load More' feature)BeautifulSoup与“加载更多”的分页列表

我看到一个脚本,我相信这是调用其他列表:

o.on("company_filter_fetch_page_complete", function(e) { 
    return t.ajax({ 
     url: "/companies/startups", 
     data: e, 
     dataType: "json", 
     success: function(t) { 
      return t.html ? 
       (E().find(".more").empty().replaceWith(t.html), 
       c()) : void 0 
     } 
    }) 
}), 

谢谢!

+0

如果有帮助,该脚本以上是: filters:function(s,l)var c,u,d,h,p,f,g,m,v,y,b,_,w ,x,C,A,k,S,N,T,E,D,I,$,P,M; 返回u =新的o(s(“。当前显示”),l.data(“排序”)), u.set_data(l.data(“init_data”)), u.render({ :!l.data(“new”) }), – taylorhamcheese

要刮使用AJAX动态加载的数据,你需要做大量的工作来得到你真正想要的HTML:

import requests 
from bs4 import BeautifulSoup 

header = { 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", 
    "X-Requested-With": "XMLHttpRequest", 
    } 

with requests.Session() as s: 
    r = s.get("https://angel.co/companies").content 
    csrf = BeautifulSoup(r).select_one("meta[name=csrf-token]")["content"] 
    header["X-CSRF-Token"] = csrf 
    ids = s.post("https://angel.co/company_filters/search_data", data={"sort": "signal"}, headers=header).json() 
    _ids = "".join(["ids%5B%5D={}&".format(i) for i in ids.pop("ids")]) 
    rest = "&".join(["{}={}".format(k,v) for k,v in ids.items()]) 
    url = "https://angel.co/companies/startups?{}{}".format(_ids, rest) 
    rsp = s.get(url, headers=header) 
    print(rsp.json()) 

我们首先需要获得一个有效的CSRF令牌这是最初的要求做什么,那么我们就需要发布到https://angel.co/company_filters/search_data

enter image description here

这给了我们:

{"ids":[296769,297064,60,63,112,119,130,160,167,179,194,236,281,287,312,390,433,469,496,516],"total":908164,"page":1,"sort":"signal","new":false,"hexdigest":"3f4980479bd6dca37e485c80d415e848a57c43ae"} 

他们需要我们获取到https://angel.co/companies/startups即我们的最后一个请求的PARAMS:

enter image description here

该请求,则给了我们更多的JSON持有HTML和公司的所有信息:

{"html":"<div class=\" dc59 frs86 _a _jm\" data-_tn=\"companies/results ........... 

有太多的帖子,但这是你需要解析的。

所以把他们放在一起:

In [3]: header = { 
    ...:  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", 
    ...:  "X-Requested-With": "XMLHttpRequest", 
    ...: } 

In [4]: with requests.Session() as s: 
    ...:   r = s.get("https://angel.co/companies").content 
    ...:   csrf = BeautifulSoup(r, "lxml").select_one("meta[name=csrf-token]")["content"] 
    ...:   header["X-CSRF-Token"] = csrf 
    ...:   ids = s.post("https://angel.co/company_filters/search_data", data={"sort": "signal"}, headers=header).json() 
    ...:   _ids = "".join(["ids%5B%5D={}&".format(i) for i in ids.pop("ids")]) 
    ...:   rest = "&".join(["{}={}".format(k, v) for k, v in ids.items()]) 
    ...:   url = "https://angel.co/companies/startups?{}{}".format(_ids, rest) 
    ...:   rsp = s.get(url, headers=header) 
    ...:   soup = BeautifulSoup(rsp.json()["html"], "lxml") 
    ...:   for comp in soup.select("div.base.startup"): 
    ...:     text = comp.select_one("div.text") 
    ...:     print(text.select_one("div.name").text.strip()) 
    ...:     print(text.select_one("div.pitch").text.strip()) 
    ...:   
Frontback 
Me, now. 
Outbound 
Optimizely for messages 
Adaptly 
The Easiest Way to Advertise Across The Social Web. 
Draft 
Words with Friends for Fantasy (w/ real money) 
Graphicly 
an automated ebook publishing and distribution platform 
Appstores 
App Distribution Platform 
eVenues 
Online Marketplace & Booking Engine for Unique Meeting Spaces 
WePow 
Video & Mobile Recruitment 
DoubleDutch 
Event Marketing Automation Software 
ecomom 
It's all good 
BackType 
Acquired by Twitter 
Stipple 
Native advertising for the visual web 
Pinterest 
A Universal Social Catalog 
Socialize 
Identify and reward your most influential users with our drop-in social platform. 
StyleSeat 
Largest and fastest growing marketplace in the $400B beauty and wellness industry 
LawPivot 
99 Designs for legal 
Ostrovok 
Leading hotel booking platform for Russian-speakers 
Thumb 
Leading mobile social network that helps people get instant opinions 
AppFog 
Making developing applications on the cloud easier than ever before 
Artsy 
Making all the world’s art accessible to anyone with an Internet connection. 

至于分页去,你被限制为每天20页,但让所有20页是简单地添加page:page_no我们的表单数据的情况下,得到新的PARAMS需要,data={"sort": "signal","page":page},当你点击加载更多,你可以看到什么是贴:

enter image description here

所以最终代码:

import requests 
from bs4 import BeautifulSoup 

def parse(soup): 

     for comp in soup.select("div.base.startup"): 
      text = comp.select_one("div.text") 
      yield (text.select_one("div.name").text.strip()), text.select_one("div.pitch").text.strip() 

def connect(page): 
    header = { 
     "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", 
     "X-Requested-With": "XMLHttpRequest", 
    } 

    with requests.Session() as s: 
     r = s.get("https://angel.co/companies").content 
     csrf = BeautifulSoup(r, "lxml").select_one("meta[name=csrf-token]")["content"] 
     header["X-CSRF-Token"] = csrf 
     ids = s.post("https://angel.co/company_filters/search_data", data={"sort": "signal","page":page}, headers=header).json() 
     _ids = "".join(["ids%5B%5D={}&".format(i) for i in ids.pop("ids")]) 
     rest = "&".join(["{}={}".format(k, v) for k, v in ids.items()]) 
     url = "https://angel.co/companies/startups?{}{}".format(_ids, rest) 
     rsp = s.get(url, headers=header) 
     soup = BeautifulSoup(rsp.json()["html"], "lxml") 
     for n, p in parse(soup): 
      yield n, p 
for i in range(1, 21): 
    for name, pitch in connect(i): 
     print(name, pitch) 

很明显,你解析的是你自己,但你在浏览器中看到的结果中的所有内容都可用。

+0

Padraic,谢谢你的深思熟虑的回答。我敢肯定,如果没有你,我不会有太多的事情,考虑我是如何攻击这个问题的。 奇怪的是,我始终获得383个独特的项目作为输出。任何想法为什么这可能是?我相信页面应该吐出接近900k的结果。 – taylorhamcheese

+0

@ TylerHudson-Crimi,如果您单击加载次数超过19次,您将看到*每个查询最多可达20页* –

+0

@ TylerHudson-Crimi,您实际需要什么信息? –