如何在没有HTML标记的情况下选择文本

问题描述：

我正在使用一个网页抓取工具（使用Python），所以我有一大块HTML，我试图从中提取文本。其中一个代码片段如下所示：如何在没有HTML标记的情况下选择文本

<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>

我想从该类中提取文本。现在，我可以用的东西沿着

//p[@class='something')]//text()

线，但这会导致文本的每个块作为一个单独的结果元素结束了，像这样：

(This class has some ,text, and a few ,links, in it.)

所需的输出将包含所有文本在一个元素中，像这样：

This class has some text and a few links in it.

是否有一种简单或优雅的方式来实现这一目标？

编辑：下面是生成上面给出结果的代码。

from lxml import html 

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>' 

xpath_query = "//p[@class='something']//text()" 

tree = html.fromstring(html_snippet) 
query_results = tree.xpath(xpath_query) 
for item in query_results: 
    print "'{0}'".format(item)

什么HTML解析库您使用？ – alecxe 2015-04-01 19:03:34

我正在使用lxml，我已经更新了这个问题。 – Yuka 2015-04-01 19:10:38

答

你可以称之为.text_content()上lxml的元素，而不是获取使用XPath的文本。

from lxml import html 

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>' 

xpath_query = "//p[@class='something']" 

tree = html.fromstring(html_snippet) 
query_results = tree.xpath(xpath_query) 
for item in query_results: 
    print "'{0}'".format(item.text_content())

答

您可以在XPath中使用normalize-space()。然后

from lxml import html 

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>' 
xpath_query = "normalize-space(//p[@class='something'])" 

tree = html.fromstring(html_snippet) 
print tree.xpath(xpath_query)

将产生

This class has some text and a few links in it.

答

您的原始代码的替代一行程序：使用join一个空字符串分隔符：

print("".join(query_results))

如何在没有HTML标记的情况下选择文本

相关推荐