Python Dom解析
问题描述:
我有一个像下面这样格式的xhtml文件。我试图按顺序获取标签之间的所有文本。我可以通过拨打我的this_list = get_e('td')
,然后将该列表传递到另一个函数来获取文本,如get_text(this_list)
,从而可以接收除BAC以外的所有内容。我想知道是否可以对我的函数进行一些细微的修改以获取所有文本。任何人都可以提供一些建议吗?Python Dom解析
<tr>
<td colspan="1" rowspan="1" class="lft">
<a shape="rect" href="http://www.usatoday.idmanagedsolutions.com/stocks/new/quote.idms?SYMBOL_US=BAC">
BAC</a>
</td>
<td colspan="1" rowspan="1" class="lft">
Bank Of America Corporation</td>
<td colspan="1" rowspan="1">
9.79
</td>
<td colspan="1" rowspan="1">
-0.07
</td>
<td colspan="1" rowspan="1">
<span class="neg-arrw">
-0.71%
</span>
</td>
<td colspan="1" rowspan="1">
71,370,166
</td>
</tr>
<tr class="evenrow">
<td colspan="1" rowspan="1" class="lft">
VALE
</td>
<td colspan="1" rowspan="1" class="lft">
Vale S A
</td>
<td colspan="1" rowspan="1">
17.52
</td>
<td colspan="1" rowspan="1">
+0.09
</td>
<td colspan="1" rowspan="1">
<span class="pos-arrw">
+0.49%
</span>
</td>
<td colspan="1" rowspan="1">
15,461,788</td>
</tr>
我使用的功能下
def get_e(tag):
l=[]
els=dom.getElementsByTagName(tag)
for e in els:
for child_el in els.childNode:
lst.append(child_el)
return l
def get_text(els):
l=[]
for e in els
if e.nodeType == e.TEXT_NODE:
l.append(e.data)
return lst
答
的get_text函数需要刚刚文本节点的输入。你的一些td已经嵌入了a元素节点。我已经更新了这个以查看元素节点递归调用get_e。
from xml.dom import minidom
import pdb
def get_e(dom, tag):
l=[]
els=dom.getElementsByTagName(tag)
for e in els:
for child_el in e.childNodes:
# if this was an element node get its children
if child_el.nodeType == e.ELEMENT_NODE:
l.extend(get_e(e, child_el.tagName))
else:
l.append(child_el)
return l
def get_text(els):
l=[]
for e in els:
if e.nodeType == e.TEXT_NODE:
l.append(e.data)
return l
dom = minidom.parse('s.xml')
print get_text(get_e(dom, 'td'))
或许你可以考虑短: -
import xml.etree.ElementTree as ET
et = ET.parse('s.xml')
print [e.findtext('.') for e in et.findall('.//*')]