使用Python抓取网站后获取特定数据

问题描述：

这是我第一个通过关注YouTube视频编写的Python项目。虽然不是很精通，但我认为我有编码的基础知识。使用Python抓取网站后获取特定数据

#importing the module that allows to connect to the internet 
import requests 

#this allows to get data from by crawling webpages 
from bs4 import BeautifulSoup 

#creating a loop to change url everytime it is executed 
def creator_spider(max_pages): 
page = 0 
while page < max_pages: 
    url = 'https://www.patreon.com/sitemap/campaigns/' + str(page) 
    source_code = requests.get(url) 

    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "html.parser") 

    for link in soup.findAll('a', {'class': ''}): 
     href = "https://www.patreon.com" + link.get('href') 
     #title = link.string 
     print(href) 
     #print(title) 
     get_single_item_data(href) 
    page = page + 1 

def get_single_item_data(item_url): 
    source_code = requests.get(item_url) 
    plain_text = source_code.text 

    soup = BeautifulSoup(plain_text, "html.parser") 
    print soup 
    for item_name in soup.findAll('h6'): 
    print(item_name.string)

从每一页我爬，我想要的代码来获得这个突出的信息：http://imgur.com/a/e59S9 其源代码是：http://imgur.com/a/8qv7k

我估计是我应该改变soup.findAll的属性（）在get_single_item_data（）函数中，但是我所有的尝试都是徒劳的。对此非常感谢。

这是一个javascript网站，无法检索。你需要模拟一个真正的浏览器来抓取这些页面。你可以尝试硒或phantomjs – sailesh

答

从BS4文档

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

这是非常有用的搜索具有一定的CSS类的标记，但CSS属性的名称，“类”，是Python中的保留字。使用class作为关键字参数会给你一个语法错误。作为美丽的汤4.1.2，您可以使用关键字参数class_通过CSS类搜索：

soup.find_all("a", class_="sister") 
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

你在PIC提到的这种做法不会得到你想要的代码仔细看看后不过。在源代码中，我看到了data-react-id。 DOM由ReactJS构建，requests.get（url）不会在您的端执行JS。在浏览器中禁用JS以查看requests.get（url）返回的内容。

此致敬礼

使用Python抓取网站后获取特定数据

相关推荐