BeautifulSoup与“加载更多”的分页列表
很新的在这里,所以提前道歉。我期待从https://angel.co/companies获得所有公司描述的清单,以便与之一起玩。我试过的基于网络的解析工具并没有削减它,所以我正在寻找一个简单的Python脚本。我应该从获取所有公司URL的数组开始,然后循环访问它们吗?任何资源或方向将有所帮助 - 我已浏览BeautifulSoup的文档和一些帖子/视频教程,但我越来越挂上模拟json请求等(请参阅:Get all links with BeautifulSoup from a single page website ('Load More' feature))BeautifulSoup与“加载更多”的分页列表
我看到一个脚本,我相信这是调用其他列表:
o.on("company_filter_fetch_page_complete", function(e) {
return t.ajax({
url: "/companies/startups",
data: e,
dataType: "json",
success: function(t) {
return t.html ?
(E().find(".more").empty().replaceWith(t.html),
c()) : void 0
}
})
}),
谢谢!
要刮使用AJAX动态加载的数据,你需要做大量的工作来得到你真正想要的HTML:
import requests
from bs4 import BeautifulSoup
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
with requests.Session() as s:
r = s.get("https://angel.co/companies").content
csrf = BeautifulSoup(r).select_one("meta[name=csrf-token]")["content"]
header["X-CSRF-Token"] = csrf
ids = s.post("https://angel.co/company_filters/search_data", data={"sort": "signal"}, headers=header).json()
_ids = "".join(["ids%5B%5D={}&".format(i) for i in ids.pop("ids")])
rest = "&".join(["{}={}".format(k,v) for k,v in ids.items()])
url = "https://angel.co/companies/startups?{}{}".format(_ids, rest)
rsp = s.get(url, headers=header)
print(rsp.json())
我们首先需要获得一个有效的CSRF令牌这是最初的要求做什么,那么我们就需要发布到https://angel.co/company_filters/search_data
:
这给了我们:
{"ids":[296769,297064,60,63,112,119,130,160,167,179,194,236,281,287,312,390,433,469,496,516],"total":908164,"page":1,"sort":"signal","new":false,"hexdigest":"3f4980479bd6dca37e485c80d415e848a57c43ae"}
他们需要我们获取到https://angel.co/companies/startups
即我们的最后一个请求的PARAMS:
该请求,则给了我们更多的JSON持有HTML和公司的所有信息:
{"html":"<div class=\" dc59 frs86 _a _jm\" data-_tn=\"companies/results ...........
有太多的帖子,但这是你需要解析的。
所以把他们放在一起:
In [3]: header = {
...: "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
...: "X-Requested-With": "XMLHttpRequest",
...: }
In [4]: with requests.Session() as s:
...: r = s.get("https://angel.co/companies").content
...: csrf = BeautifulSoup(r, "lxml").select_one("meta[name=csrf-token]")["content"]
...: header["X-CSRF-Token"] = csrf
...: ids = s.post("https://angel.co/company_filters/search_data", data={"sort": "signal"}, headers=header).json()
...: _ids = "".join(["ids%5B%5D={}&".format(i) for i in ids.pop("ids")])
...: rest = "&".join(["{}={}".format(k, v) for k, v in ids.items()])
...: url = "https://angel.co/companies/startups?{}{}".format(_ids, rest)
...: rsp = s.get(url, headers=header)
...: soup = BeautifulSoup(rsp.json()["html"], "lxml")
...: for comp in soup.select("div.base.startup"):
...: text = comp.select_one("div.text")
...: print(text.select_one("div.name").text.strip())
...: print(text.select_one("div.pitch").text.strip())
...:
Frontback
Me, now.
Outbound
Optimizely for messages
Adaptly
The Easiest Way to Advertise Across The Social Web.
Draft
Words with Friends for Fantasy (w/ real money)
Graphicly
an automated ebook publishing and distribution platform
Appstores
App Distribution Platform
eVenues
Online Marketplace & Booking Engine for Unique Meeting Spaces
WePow
Video & Mobile Recruitment
DoubleDutch
Event Marketing Automation Software
ecomom
It's all good
BackType
Acquired by Twitter
Stipple
Native advertising for the visual web
Pinterest
A Universal Social Catalog
Socialize
Identify and reward your most influential users with our drop-in social platform.
StyleSeat
Largest and fastest growing marketplace in the $400B beauty and wellness industry
LawPivot
99 Designs for legal
Ostrovok
Leading hotel booking platform for Russian-speakers
Thumb
Leading mobile social network that helps people get instant opinions
AppFog
Making developing applications on the cloud easier than ever before
Artsy
Making all the world’s art accessible to anyone with an Internet connection.
至于分页去,你被限制为每天20页,但让所有20页是简单地添加page:page_no
我们的表单数据的情况下,得到新的PARAMS需要,data={"sort": "signal","page":page}
,当你点击加载更多,你可以看到什么是贴:
所以最终代码:
import requests
from bs4 import BeautifulSoup
def parse(soup):
for comp in soup.select("div.base.startup"):
text = comp.select_one("div.text")
yield (text.select_one("div.name").text.strip()), text.select_one("div.pitch").text.strip()
def connect(page):
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
with requests.Session() as s:
r = s.get("https://angel.co/companies").content
csrf = BeautifulSoup(r, "lxml").select_one("meta[name=csrf-token]")["content"]
header["X-CSRF-Token"] = csrf
ids = s.post("https://angel.co/company_filters/search_data", data={"sort": "signal","page":page}, headers=header).json()
_ids = "".join(["ids%5B%5D={}&".format(i) for i in ids.pop("ids")])
rest = "&".join(["{}={}".format(k, v) for k, v in ids.items()])
url = "https://angel.co/companies/startups?{}{}".format(_ids, rest)
rsp = s.get(url, headers=header)
soup = BeautifulSoup(rsp.json()["html"], "lxml")
for n, p in parse(soup):
yield n, p
for i in range(1, 21):
for name, pitch in connect(i):
print(name, pitch)
很明显,你解析的是你自己,但你在浏览器中看到的结果中的所有内容都可用。
Padraic,谢谢你的深思熟虑的回答。我敢肯定,如果没有你,我不会有太多的事情,考虑我是如何攻击这个问题的。 奇怪的是,我始终获得383个独特的项目作为输出。任何想法为什么这可能是?我相信页面应该吐出接近900k的结果。 – taylorhamcheese
@ TylerHudson-Crimi,如果您单击加载次数超过19次,您将看到*每个查询最多可达20页* –
@ TylerHudson-Crimi,您实际需要什么信息? –
如果有帮助,该脚本以上是: filters:function(s,l)var c,u,d,h,p,f,g,m,v,y,b,_,w ,x,C,A,k,S,N,T,E,D,I,$,P,M; 返回u =新的o(s(“。当前显示”),l.data(“排序”)), u.set_data(l.data(“init_data”)), u.render({ :!l.data(“new”) }), – taylorhamcheese