Python Dom解析

问题描述:

我有一个像下面这样格式的xhtml文件。我试图按顺序获取标签之间的所有文本。我可以通过拨打我的this_list = get_e('td'),然后将该列表传递到另一个函数来获取文本,如get_text(this_list),从而可以接收除BAC以外的所有内容。我想知道是否可以对我的函数进行一些细微的修改以获取所有文本。任何人都可以提供一些建议吗?Python Dom解析

<tr> 
    <td colspan="1" rowspan="1" class="lft"> 
    <a shape="rect" href="http://www.usatoday.idmanagedsolutions.com/stocks/new/quote.idms?SYMBOL_US=BAC"> 
     BAC</a> 
    </td> 
    <td colspan="1" rowspan="1" class="lft"> 
    Bank Of America Corporation</td> 
    <td colspan="1" rowspan="1"> 
    9.79 
    </td> 
    <td colspan="1" rowspan="1"> 
    -0.07 
    </td> 
    <td colspan="1" rowspan="1"> 
    <span class="neg-arrw"> 
     -0.71% 
    </span> 
    </td> 
    <td colspan="1" rowspan="1"> 
    71,370,166 
    </td> 
</tr> 
<tr class="evenrow"> 
    <td colspan="1" rowspan="1" class="lft"> 
    VALE 
    </td> 
    <td colspan="1" rowspan="1" class="lft"> 
    Vale S A 
    </td> 
<td colspan="1" rowspan="1"> 
    17.52 
    </td> 
    <td colspan="1" rowspan="1"> 
    +0.09 
    </td> 
    <td colspan="1" rowspan="1"> 
    <span class="pos-arrw"> 
     +0.49% 
    </span> 
    </td> 
    <td colspan="1" rowspan="1"> 
    15,461,788</td> 
</tr> 

我使用的功能下

def get_e(tag): 
    l=[] 
    els=dom.getElementsByTagName(tag) 
    for e in els: 
     for child_el in els.childNode: 
      lst.append(child_el) 
    return l 

def get_text(els): 
    l=[] 
    for e in els 
     if e.nodeType == e.TEXT_NODE: 
      l.append(e.data) 
    return lst 

的get_text函数需要刚刚文本节点的输入。你的一些td已经嵌入了a元素节点。我已经更新了这个以查看元素节点递归调用get_e。

from xml.dom import minidom 
import pdb 

def get_e(dom, tag): 
    l=[] 
    els=dom.getElementsByTagName(tag) 
    for e in els: 
     for child_el in e.childNodes: 
      # if this was an element node get its children 
      if child_el.nodeType == e.ELEMENT_NODE: 
       l.extend(get_e(e, child_el.tagName)) 
      else: 
       l.append(child_el) 
    return l 

def get_text(els): 
    l=[] 
    for e in els: 
     if e.nodeType == e.TEXT_NODE: 
      l.append(e.data) 
    return l 

dom = minidom.parse('s.xml') 
print get_text(get_e(dom, 'td')) 

或许你可以考虑短: -

import xml.etree.ElementTree as ET 
et = ET.parse('s.xml') 
print [e.findtext('.') for e in et.findall('.//*')]