BeautifulSoup:拉一个标签在另一个标签之前
问题描述:
我拉在网页上的列表,并给他们上下文,我也拉动他们之前的文本。拉动<ul>
或<ol>
标签之前的标签似乎是最好的方法。所以我们可以说我有这样的名单:BeautifulSoup:拉一个标签在另一个标签之前
我想拉子弹和单词“千禧一代”。我使用的是BeautifulSoup功能:
#pull <ul> tags
def pull_ul(tag):
return tag.name == 'ul' and tag.li and not tag.attrs and not tag.li.attrs and not tag.a
ul_tags = webpage.find_all(pull_ul)
#find text immediately preceding any <ul> tag and append to <ul> tag
ul_with_context = [str(ul.previous_sibling) + str(ul) for ul in ul_tags]
当我打印ul_with_context,我得到如下:
['\n<ul>\n<li>With immigration adding more numbers to its group than any other, the Millennial population is projected to peak in 2036 at 81.1 million. Thereafter the oldest Millennial will be at least 56 years of age and mortality is projected to outweigh net immigration. By 2050 there will be a projected 79.2 million Millennials.</li>\n</ul>']
正如你所看到的, “千禧一代” 并没有拉。我从拉页http://www.pewresearch.org/fact-tank/2016/04/25/millennials-overtake-baby-boomers/ 这里是代码子弹的部分:
的<p>
和<ul>
标签是兄弟姐妹。任何想法为什么它不拉扯字“千禧一代”在它的标签?
答
Previous_sibling
将返回标签前的元素或字符串。在你的情况下,它返回字符串'\n'
。
相反,你可以使用findPrevious method来获取节点之前,你选择什么:
doc = """
<h2>test</h2>
<ul>
<li>1</li>
<li>2</li>
</ul>
"""
soup = BeautifulSoup(doc, 'html.parser')
tags = soup.find_all('ul')
print [ul.findPrevious() for ul in tags]
print tags
将输出:
[<h2>test</h2>]
[<ul><li>1</li><li>2</li></ul>]
在BeautifulSoup的当前版本中我使用的,该方法是find_previous()而不是findPrevious() –