Python的 - 检查请求收到完整的网页
问题描述:
我用这个功能在我的脚本来请求一个网页的BeautifoulSoup对象:Python的 - 检查请求收到完整的网页
def getSoup(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36'
}
i = 0
while i == 0:
print '(%s) (INFO) Connecting to: %s ...' % (getTime(), url)
data = requests.get(url, headers=headers).text
soup = BeautifulSoup(data, 'lxml')
if soup == None:
print '(%s) (WARN) Received \'None\' BeautifulSoup object, retrying in 5 seconds ...' % getTime()
time.sleep(5)
else:
i = 1
return soup
这个循环,直到我收到有效BeautifulSoup对象,但我在想,我也可能收到一个不完整的网页,但仍然有一个有效的BeautifulSoup对象。我想使用类似的东西:
if '</hml>' in str(data):
#the page is completly loaded
但我不知道是否可以安全地以这种方式使用它。有没有一种安全的方法来检查页面是否已正确下载请求或BeautifulSoup?
答
的一种方法是检查请求的状态代码,看看你是否收到了部分内容的响应(206)。列出标准HTTP响应及其定义列表here
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.data + partial_data, 'lxml')
partial_data = None
if soup == None:
print '(%s) (WARN) Received \'None\' BeautifulSoup object, retrying in 5 seconds ...' % getTime()
time.sleep(5)
elif reponse.status_code == 206:
# store partial data here
partial_data += response.data