好的python XML解析器能够处理命名空间中的重文档

问题描述:

Python elementTree似乎对命名空间不可用。我有什么选择? BeautifulSoup也是名字空间的垃圾。 我不想将它们去掉。好的python XML解析器能够处理命名空间中的重文档

一个特定的python库如何获取名称空间元素及其集合的例子都是+1。

编辑:你能提供代码来处理这个真实世界的用例使用你选择的库吗?

你会如何去得到字符串 '换行', '2.6' 和一个列表[ 'PYTHON', 'XML', 'XML的命名空间']

<?xml version="1.0" encoding="UTF-8"?> 
<zs:searchRetrieveResponse 
    xmlns="http://unilexicon.com/vocabularies/" 
    xmlns:zs="http://www.loc.gov/zing/srw/" 
    xmlns:dc="http://purl.org/dc/elements/1.1/" 
    xmlns:lom="http://ltsc.ieee.org/xsd/LOM"> 
    <zs:records> 
     <zs:record> 
      <zs:recordData> 
       <srw_dc:dc xmlns:srw_dc="info:srw/schema/1/dc-schema"> 
        <name>Line Break</name> 
        <dc:title>Processing XML namespaces using Python</dc:title> 
        <dc:description>How to get contents string from an element, 
         how to get a collection in a list...</dc:description> 
        <lom:metaMetadata> 
         <lom:identifier> 
          <lom:catalog>Python</lom:catalog> 
          <lom:entry>2.6</lom:entry> 
         </lom:identifier> 
        </lom:metaMetadata> 
        <lom:classification> 
         <lom:taxonPath> 
          <lom:taxon> 
           <lom:id>PYTHON</lom:id> 
          </lom:taxon> 
         </lom:taxonPath> 
        </lom:classification> 
        <lom:classification> 
         <lom:taxonPath> 
          <lom:taxon> 
           <lom:id>XML</lom:id> 
          </lom:taxon> 
         </lom:taxonPath> 
        </lom:classification> 
        <lom:classification> 
         <lom:taxonPath> 
          <lom:taxon> 
           <lom:id>XML-NAMESPACES</lom:id> 
          </lom:taxon> 
         </lom:taxonPath> 
        </lom:classification> 
       </srw_dc:dc> 
      </zs:recordData> 
     </zs:record> 
     <!-- ... more records ... --> 
    </zs:records> 
</zs:searchRetrieveResponse> 
+1

我爱你MWE的元性质。 – 2013-10-09 14:05:38

+0

在示例代码中使用相关关键字意味着更多的用户可以找到问题和答案。 – 2013-10-10 10:21:02

lxml是名称空间感知的。

>>> from lxml import etree 
>>> et = etree.XML("""<root xmlns="foo" xmlns:stuff="bar"><bar><stuff:baz /></bar></root>""") 
>>> etree.tostring(et, encoding=str) # encoding=str only needed in Python 3, to avoid getting bytes 
'<root xmlns="foo" xmlns:stuff="bar"><bar><stuff:baz/></bar></root>' 
>>> et.xpath("f:bar", namespaces={"b":"bar", "f": "foo"}) 
[<Element {foo}bar at ...>] 

编辑:在您的例子:

from lxml import etree 

# remove the b prefix in Python 2 
# needed in python 3 because 
# "Unicode strings with encoding declaration are not supported." 
et = etree.XML(b"""...""") 

ns = { 
    'lom': 'http://ltsc.ieee.org/xsd/LOM', 
    'zs': 'http://www.loc.gov/zing/srw/', 
    'dc': 'http://purl.org/dc/elements/1.1/', 
    'voc': 'http://www.schooletc.co.uk/vocabularies/', 
    'srw_dc': 'info:srw/schema/1/dc-schema' 
} 

# according to docs, .xpath returns always lists when querying for elements 
# .find returns one element, but only supports a subset of XPath 
record = et.xpath("zs:records/zs:record", namespaces=ns)[0] 
# in this example, we know there's only one record 
# but else, you should apply the following to all elements the above returns 

name = record.xpath("//voc:name", namespaces=ns)[0].text 
print("name:", name) 

lom_entry = record.xpath("zs:recordData/srw_dc:dc/" 
         "lom:metaMetadata/lom:identifier/" 
         "lom:entry", 
         namespaces=ns)[0].text 

print('lom_entry:', lom_entry) 

lom_ids = [id.text for id in 
      record.xpath("zs:recordData/srw_dc:dc/" 
         "lom:classification/lom:taxonPath/" 
         "lom:taxon/lom:id", 
         namespaces=ns)] 

print("lom_ids:", lom_ids) 

输出:

name: Frank Malina 
lom_entry: 2.6 
lom_ids: ['PYTHON', 'XML', 'XML-NAMESPACES'] 
+2

+1 lxml是您需要用于xml/xslt/xpath相关任务的唯一python工具/包 – snapshoe 2010-09-25 04:42:47

+0

编辑:您将如何对提供的示例进行编码?这种lxml工作在网络上缺乏食谱是令人震惊的。目前,我已经通过剥离命名空间并遍历BeautifulSoup。这在多个层面上并不理想。 – 2010-09-25 10:46:16

+0

@Frank Malina:XPath不是特定于lxml的,在Web上的XPath上有一些可用的资源。但我会刺穿它... – delnan 2010-09-25 11:00:35

+0

你有和它如何与命名空间一起使用的例子吗? – 2010-09-24 09:29:33

的libxml(http://xmlsoft.org/) 最佳,XML解析更快库。 python有实现。

+4

来自codespeak的lxml包装和使用libxml – snapshoe 2010-09-25 04:41:32