美丽的汤,虽然声明
我尝试使用以下BeautifulSoup脚本查找第30个TED视频(视频的名称和网址):美丽的汤,虽然声明
import urllib2
from BeautifulSoup import BeautifulSoup
total_pages = 3
page_count = 1
count = 1
url = 'http://www.ted.com/talks?page='
while page_count < total_pages:
page = urllib2.urlopen("%s%d") %(url, page_count)
soup = BeautifulSoup(page)
link = soup.findAll(lambda tag: tag.name == 'a' and tag.findParent('dt', 'thumbnail'))
outfile = open("test.html", "w")
print >> outfile, """<head>
<head>
<title>TED Talks Index</title>
</head>
<body>
<br><br><center>
<table cellpadding=15 cellspacing=0 style='border:1px solid #000;'>"""
print >> outfile, "<tr><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'><b>###</b></th><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'>Name</th><th style='border-bottom:2px solid #E16543;'>URL</th></tr>"
ted_link = 'http://www.ted.com/'
for anchor in link:
print >> outfile, "<tr style='border-bottom:1px solid #000;'><td style='border-right:1px solid #000;'>%s</td><td style='border-right:1px solid #000;'>%s</td><td>http://www.ted.com%s</td></tr>" % (count, anchor['title'], anchor['href'])
count = count + 1
print >> outfile, """</table>
</body>
</html>"""
page_count = page_count + 1
代码看起来正常的减两件事情:
计数似乎没有增加。它只会经过并找到第一页的内容,即:前十个,而不是三十个视频。为什么?
这段代码给了我很多错误。我不知道该怎么实现我逻辑想在这里(用的urlopen( “%s%d”):
代码:
total_pages = 3
page_count = 1
count = 1
url = 'http://www.ted.com/talks?page='
while page_count < total_pages:
page = urllib2.urlopen("%s%d") %(url, page_count)
首先,简化循环,消除几个变量,其数额在这种情况下,样板克鲁夫特:
for pagenum in xrange(1, 4): # The 4 is annoying, write it as 3+1 if you like.
url = "http://www.ted.com/talks?page=%d" % pagenum
# do stuff with url
但是,让我们打开的文件的循环之外,而不是重新打开它每一次迭代这就是为什么你只看到10分的结果:谈判11-20而不是像你想象的那样前十应该是21-30,只是你在page_count < total_pages
上打圈,只处理前两页。)
并且一次收集所有链接,然后写出输出。我已经剥离了HTML样式,这也使代码更容易遵循;相反,使用CSS,可能是内嵌的< style>元素,或者如果你喜欢,可以将其添加回去。
import urllib2
from cgi import escape # Important!
from BeautifulSoup import BeautifulSoup
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
links = []
for pagenum in xrange(1, 4):
soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum))
links.extend(soup.findAll(is_talk_anchor))
out = open("test.html", "w")
print >>out, """<html><head><title>TED Talks Index</title></head>
<body>
<table>
<tr><th>#</th><th>Name</th><th>URL</th></tr>"""
for x, a in enumerate(links):
print >>out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td></tr>" % (x + 1, escape(a["title"]), escape(a["href"]))
print >>out, "</table>"
# Or, as an ordered list:
print >>out, "<ol>"
for a in links:
print >>out, """<li><a href="http://www.ted.com%s">%s</a></li>""" % (escape(a["href"], True), escape(a["title"]))
print >>out, "</ol>"
print >>out, "</body></html>"
Thanks!你能解释一下”from cgi import escape“吗? – EGP 2011-04-29 06:48:58
@AdamC .:如果其中一个URL或标题包含一个对HTML特殊的字符,即&, 2011-04-29 06:52:47
它不会解决您的问题,但你有两个开放''
标签,而不是''和''标签:(IE'打印>> OUTFILE “”” '应该是' print >> outfile,“”“' – 2011-04-29 06:08:12