如何使用美丽汤从xml标签中提取属性?
我想在Django中使用美丽的汤来提取xml标签。这是我正在使用的标记示例:如何使用美丽汤从xml标签中提取属性?
<item>
<title>
Title goes here
</title>
<link>
Link1 goes here
</link>
<description>
Description goes here
</description>
<media:thumbnail url="Image URL goes here" height="222" width="300"/>
<pubDate>Thu, 15 Sep 2016 13:24:48 EDT</pubDate>
<guid isPermaLink="true">
Link2 goes here
</guid>
</item>
我已经获得标题,链接和描述标记的字符串。但是我无法从media:thumbnail
标签获取网址。
这是我得到的标签的其余部分的值的片段:
soup=BeautifulSoup(urlopen(xmllink),'xml')
for items in soup.find_all('item'):
listTitle.append(items.title.get_text())
listURL.append(items.link.get_text())
listDescription.append(items.description.get_text())
帮助
的问题是,因为不是每一个项目都有一个媒体:缩略图所以你需要检查第一:
In [60]: import requests
In [61]: from bs4 import BeautifulSoup
In [62]: soup = BeautifulSoup(requests.get("https://rss.sciencedaily.com/computers_math/computer_programming.xml").content, "xml")
In [63]:
In [63]: for item in soup.find_all("item"):
....: thumb = item.find("thumbnail")
....: if thumb:
....: print(thumb["url"])
....:
https://images.sciencedaily.com/2016/09/160915132448.jpg
https://images.sciencedaily.com/2016/09/160915090018.jpg
https://images.sciencedaily.com/2016/09/160914090327.jpg
https://images.sciencedaily.com/2016/09/160913134149.jpg
https://images.sciencedaily.com/2016/09/160909094844.jpg
https://images.sciencedaily.com/2016/09/160907125004.jpg
https://images.sciencedaily.com/2016/09/160906085157.jpg
https://images.sciencedaily.com/2016/08/160831085055.jpg
https://images.sciencedaily.com/2016/08/160822181811.jpg
https://images.sciencedaily.com/2016/08/160815134941.jpg
https://images.sciencedaily.com/2016/08/160815134817.jpg
https://images.sciencedaily.com/2016/08/160809095640.jpg
https://images.sciencedaily.com/2016/08/160803140137.jpg
https://images.sciencedaily.com/2016/07/160722104135.jpg
https://images.sciencedaily.com/2016/07/160721144139.jpg
https://images.sciencedaily.com/2016/07/160721103855.jpg
https://images.sciencedaily.com/2016/07/160720094641.jpg
https://images.sciencedaily.com/2016/07/160718133206.jpg
https://images.sciencedaily.com/2016/07/160713105850.jpg
https://images.sciencedaily.com/2016/07/160711151055.jpg
https://images.sciencedaily.com/2016/07/160707083258.jpg
https://images.sciencedaily.com/2016/06/160629125823.jpg
https://images.sciencedaily.com/2016/06/160627125140.jpg
https://images.sciencedaily.com/2016/06/160624101050.jpg
https://images.sciencedaily.com/2016/06/160622104810.jpg
更快的替代方案将是使用LXML:
from lxml import etree
for item in tree.findall(".//item/media:thumbnail",tree.nsmap):
parent = item.getparent()
print(parent.xpath("title/text()")[0])
print(parent.xpath("link/text()")[0])
print(item.get("url"))
非常感谢!但是由于我在Django项目中,'listImage.append(items.find(“media:thumbnail”)[“url”])'抛出了一个TypeError(“'NoneType'对象不是可以下载的”), '错误。 Wihtout这条线,其他的一切都很完美 –
我使用的XML解析器 –
它仍然抛出相同的错误伙伴 –
向我们展示您到目前为止尝试过的方法。 –