如何在没有HTML标记的情况下选择文本

问题描述:

我正在使用一个网页抓取工具(使用Python),所以我有一大块HTML,我试图从中提取文本。其中一个代码片段如下所示:如何在没有HTML标记的情况下选择文本

<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p> 

我想从该类中提取文本。现在,我可以用的东西沿着

//p[@class='something')]//text() 

线,但这会导致文本的每个块作为一个单独的结果元素结束了,像这样:

(This class has some ,text, and a few ,links, in it.) 

所需的输出将包含所有文本在一个元素中,像这样:

This class has some text and a few links in it. 

是否有一种简单或优雅的方式来实现这一目标?

编辑:下面是生成上面给出结果的代码。

from lxml import html 

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>' 

xpath_query = "//p[@class='something']//text()" 

tree = html.fromstring(html_snippet) 
query_results = tree.xpath(xpath_query) 
for item in query_results: 
    print "'{0}'".format(item) 
+0

什么HTML解析库您使用? – alecxe 2015-04-01 19:03:34

+0

我正在使用lxml,我已经更新了这个问题。 – Yuka 2015-04-01 19:10:38

你可以称之为.text_content()上lxml的元素,而不是获取使用XPath的文本。

from lxml import html 

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>' 

xpath_query = "//p[@class='something']" 

tree = html.fromstring(html_snippet) 
query_results = tree.xpath(xpath_query) 
for item in query_results: 
    print "'{0}'".format(item.text_content()) 

您可以在XPath中使用normalize-space()。然后

from lxml import html 

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>' 
xpath_query = "normalize-space(//p[@class='something'])" 

tree = html.fromstring(html_snippet) 
print tree.xpath(xpath_query) 

将产生

This class has some text and a few links in it. 

您的原始代码的替代一行程序:使用join一个空字符串分隔符:

print("".join(query_results))