使用python从网站提取数据

问题描述：

我最近开始学习python，我做的第一个项目是从我儿子的教室网页上取消更新，并向我发送通知，告知他们更新了网站。这原来是一个简单的项目，所以我想扩展这个并创建一个脚本，它会自动检查我们的乐透号码是否有影响。不幸的是，我一直无法弄清楚如何从网站获取数据。这是我昨晚的一次尝试。使用python从网站提取数据

from bs4 import BeautifulSoup 
import urllib.request 

webpage = "http://www.masslottery.com/games/lottery/large-winningnumbers.html" 

websource = urllib.request.urlopen(webpage) 
soup = BeautifulSoup(websource.read(), "html.parser") 

span = soup.find("span", {"id": "winning_num_0"}) 
print (span) 

Output is here... 
<span id="winning_num_0"></span>

上面列出的输出也是我看到，如果我用浏览器“查看源代码”。当我用网络浏览器“检查元素”时，我可以在检查元素面板中看到中奖号码。不幸的是，我甚至不确定Web浏览器如何/在哪里获取数据。它是从另一个页面或脚本在后台加载的吗？我认为下面的教程会帮助我，但我无法使用类似的命令获取数据。

http://zevross.com/blog/2014/05/16/using-the-python-library-beautifulsoup-to-extract-data-from-a-webpage-applied-to-world-cup-rankings/

任何帮助表示赞赏。感谢

如果内容是动态的，你可能需要一个基于例如Selenium的方法 - http://selenium-python.readthedocs.io/api.html – ewcz

可能的重复[Reading reading dynamic web使用python]（http://stackoverflow.com/questions/13960567/reading-dynamically-generated-web-pages-using-python） – Sandeep

从开发者控制台检查该页面的功能，它从这里动态地加载数据： http://www.masslottery.com/data/json/games/lottery/recent.json 所以你可以写一个脚本来加载那个json格式的数据并从那里检查数字。比搜刮html要容易得多） – lari

答

如果在页面的源代码仔细看（我只是用curl），你可以看到这个块

<script type="text/javascript"> 
    // <![CDATA[ 
    var dataPath = '../../'; 
    var json_filename = 'data/json/games/lottery/recent.json'; 
    var games = new Array(); 
    var sessions = new Array(); 
    // ]]> 
</script>

这recent.json伸出像突兀（其实我错过了dataPath部分在第一）。

给人一个尝试，我想出了这个之后：

curl http://www.masslottery.com/data/json/games/lottery/recent.json

其中，作为拉里在评论中指出的，是方式比刮HTML更容易。这很容易，其实：

import json 
import urllib.request 
from pprint import pprint 

websource = urllib.request.urlopen('http://www.masslottery.com/data/json/games/lottery/recent.json') 
data = json.loads(websource.read().decode()) 
pprint(data)

data现在是一个字典，你可以做任何一种类似字典的东西，你想用它做。祝你好运;）

谢谢。今晚我会试试这个！ – gameoverman

为了增加乐趣，您可以随时使用python的随机模块来猜测乐透号码，看看它会给你带来多少钱。 –

哈哈。它不能做比办公乐透池更糟的... – gameoverman

使用python从网站提取数据

相关推荐