使用lxml和xpath解析Html

问题描述:

我想用python使用lxml,因为在阅读和做谷歌推荐是使用lxml而不是其他解析包。我有以下dom结构,并且我管理写入正确的xpath,然后在xpath检查我的xpath检查以确认它的有效性。 Xpath在Xpath Checker中工作正常,但是当我在Python中使用lxml时,我得不到结果infract我得到的是对象而不是实际的文本。使用lxml和xpath解析Html

这里是我的DOM结构:

<div class="pdsc-l"> 
<table width="100%" cellspacing="0" cellpadding="0" border="0"> 
<tbody> 
<tr> 
<tr> 
<tr> 
<tr> 
<tr> 
<tr> 
<td width="35%" valign="top"> 
<font size="2" face="Arial, Helvetica, sans-serif">Brand</font> 
</td> 
<td width="65%" valign="top"> 
<font size="2" face="Arial, Helvetica, sans-serif">HTC</font> 
</td> 
</tr> 
<tr> 
<td width="35%" valign="top"> 
<td width="65%" valign="top"> 

以下XPath,我写给我我想要的..

//td//font[text()='Brand']/following::td[1] 

但随着LXML我n要得到的结果:

This is my code: 
    rawPage = urllib2.urlopen(request) 
    read = rawPage.read() 
    #print read 
    tree = etree.HTML(read)  
    for tr in tree.xpath("//tr"): 
     print tr.xpath("//td//font[text()='Brand']/following::td[1]") 

这里是出把

[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 
[<Element td at 0x10ad80b90>] 

我有以下变化但我仍然没有得到结果试了一下,我的代码写有地址,希望这将有助于更好的答案:

from lxml import etree 
from lxml.html import fromstring, tostring 
    url = 'http://www.ebay.com/ctg/111176858' 
    request = urllib2.Request(url) 
    rawPage = urllib2.urlopen(request) 
    read = rawPage.read() 
    #print read 
    tree = etree.HTML(read)  
    for tr in tree.xpath("//tr"): 
     t = tr.xpath("//td//font[text()='Brand']/following::td[1]")[0] 
     print tostring(t) 
+1

也许发布您正在收到的输出,以便我们可以了解更多信息发生了什么? –

[0].text附加到打印语句的末尾,您的答案应该会给你你想要的。基本上,您的问题中打印的是单元素列表lxml.etree._Element s,其中有tagtext等属性,可用于获取不同的属性。因此,请尝试

tr.xpath("//td//font[text()='Brand']/following::td[1]")[0].text 
+0

感谢刚才添加 –

+0

我得到你的答案 –

+0

索引越界输出编辑我的回答相应 –