如何在没有HTML标记的情况下选择文本
问题描述:
我正在使用一个网页抓取工具(使用Python),所以我有一大块HTML,我试图从中提取文本。其中一个代码片段如下所示:如何在没有HTML标记的情况下选择文本
<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>
我想从该类中提取文本。现在,我可以用的东西沿着
//p[@class='something')]//text()
线,但这会导致文本的每个块作为一个单独的结果元素结束了,像这样:
(This class has some ,text, and a few ,links, in it.)
所需的输出将包含所有文本在一个元素中,像这样:
This class has some text and a few links in it.
是否有一种简单或优雅的方式来实现这一目标?
编辑:下面是生成上面给出结果的代码。
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "//p[@class='something']//text()"
tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
print "'{0}'".format(item)
答
你可以称之为.text_content()
上lxml的元素,而不是获取使用XPath的文本。
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "//p[@class='something']"
tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
print "'{0}'".format(item.text_content())
答
您可以在XPath中使用normalize-space()
。然后
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "normalize-space(//p[@class='something'])"
tree = html.fromstring(html_snippet)
print tree.xpath(xpath_query)
将产生
This class has some text and a few links in it.
答
您的原始代码的替代一行程序:使用join
一个空字符串分隔符:
print("".join(query_results))
什么HTML解析库您使用? – alecxe 2015-04-01 19:03:34
我正在使用lxml,我已经更新了这个问题。 – Yuka 2015-04-01 19:10:38