解析层次化XML标签
问题描述:
需要从XML解析分层标签和所需的输出获得标签的值解析层次化XML标签
输入
<doc>
<pid id="231">
<label key="">Electronics</label>
<desc/>
<cid id="122">
<label key="">TV</label>
</cid>
<desc/>
<cid id="123">
<label key="">Computers</label>
<cid id="12433">
<label key="">Lenovo</label>
</cid>
<desc/>
<cid id="12434">
<label key="">IBM</label>
<desc/>
</cid>
<cid id="12435">
<label key="">Mac</label>
</cid>
<desc/>
</cid>
</pid>
<pid id="7764">
<label key="">Music</label>
<desc/>
<cid id="1224">
<label key="">Play</label>
<desc/>
<cid id="341">
<label key="">PQR</label>
</cid>
<desc/>
</cid>
<cid id="221">
<label key="">iTunes</label>
<cid id="341">
<label key="">XYZ</label>
</cid>
<desc/>
<cid id="515">
<label key="">ABC</label>
</cid>
<desc/>
</cid>
</pid>
</doc>
输出
Electornics/
Electornics/TV
Electornics/Computers/Lenovo
Electornics/Computers/IBM
Electornics/Computers/Mac
Music/
Music/Play/PQR
Music/iTunes/XYZ
Music/iTunes/ABC
我有什么尝试过(in Python )
import xml.etree.ElementTree as ET
import os
import sys
import string
def perf_func(elem, func, level=0):
func(elem,level)
for child in elem.getchildren():
perf_func(child, func, level+1)
def print_level(elem,level):
print '-'*level+elem.tag
root = ET.parse('Products.xml')
perf_func(root.getroot(), print_level)
# Added find logic
root = tree.getroot()
for n in root.findall('doc')
l = n.find('label').text
print l
与上面的代码,我能够得到的节点和它的水平(也就是标记的不是他们的价值)。也是所有标签的第一级。 需要一些建议(Perl/Python)关于如何继续使用输出中提到的格式来获得雇用结构。
答
我们将使用3个部分:按照它们出现的顺序查找所有元素,获取每个元素的深度,根据深度和顺序构建面包屑。
from lxml import etree
xml = etree.fromstring(xml_str)
elems = xml.xpath(r'//label') #xpath expression to find all '<label ...> elements
# counts the number of parents to the root element
def get_depth(element):
depth = 0
parent = element.getparent()
while parent is not None:
depth += 1
parent = parent.getparent()
return depth
# build up the bread crumbs by tracking the depth
# when a new element is entered, it replaces the value in the list
# at that level and drops all values to the right
def reduce_by_depth(element_list):
crumbs = []
depth = 0
elem_crumb = ['']*10
for elem in element_list:
depth = get_depth(elem)
elem_crumb[depth] = elem.text
elem_crumb[depth+1:] = ['']*(10-depth-1)
# join all the non-empty string to get the breadcrumb
crumbs.append('/'.join([e for e in elem_crumb if e]))
return crumbs
reduce_by_depth(elems)
# output:
['Electronics',
'Electronics/TV',
'Electronics/Computers',
'Electronics/Computers/Lenovo',
'Electronics/Computers/IBM',
'Electronics/Computers/Mac',
'Music',
'Music/Play',
'Music/Play/PQR',
'Music/iTunes',
'Music/iTunes/XYZ',
'Music/iTunes/ABC']
+0
中提到的格式中的树结构非常感谢......碎屑逻辑真的很好:) – Debaditya
看看在etree'find'&'findall'功能,它需要一个XPath表达式 – FujiApple
新增查找逻辑(编辑的问题 - 什么我想)......需要关于如何得到一些建议输出 – Debaditya