用python xml.sax解析XML实体

问题描述:

用xml解析XML使用xml.sax,但是我的代码无法捕获实体。为什么不skippedEntity()或resolveEntity()在以下报告:用python xml.sax解析XML实体

import os 
import cStringIO 
import xml.sax 
from xml.sax.handler import ContentHandler,EntityResolver,DTDHandler 

#Class to parse and run test XML files 
class TestHandler(ContentHandler,EntityResolver,DTDHandler): 

    #SAX handler - Entity resolver 
    def resolveEntity(self,publicID,systemID): 
     print "TestHandler.resolveEntity: %s %s" % (publicID,systemID) 

    def skippedEntity(self, name): 
     print "TestHandler.skippedEntity: %s" % (name) 

    def unparsedEntityDecl(self,publicID,systemID,ndata): 
     print "TestHandler.unparsedEntityDecl: %s %s" % (publicID,systemID) 

    def startElement(self,name,attrs): 
     # name = string.lower(name) 
     summary = '' + attrs.get('summary','') 
     arg = '' + attrs.get('arg','') 
     print 'TestHandler.startElement(), %s : %s (%s)' % (name,summary,arg) 


def run(xml_string): 
    try: 
     parser = xml.sax.make_parser() 
     stream = cStringIO.StringIO(xml_string) 

     curHandler = TestHandler() 
     parser.setContentHandler(curHandler) 
     parser.setDTDHandler(curHandler) 
     parser.setEntityResolver(curHandler) 

     parser.parse(stream) 
     stream.close() 
    except (xml.sax.SAXParseException), e: 
     print "*** PARSER error: %s" % e; 

def main(): 
    try: 
     XML = "<!DOCTYPE page[ <!ENTITY num 'foo'> ]><test summary='step: &num;'>Entity: &not;</test>" 
     run(XML) 
    except Exception, e: 
     print 'FATAL ERROR: %s' % (str(e)) 

if __name__== '__main__': 
    main() 

运行时,我看到的是:

TestHandler.startElement(), step: foo() 
*** PARSER error: <unknown>:1:36: undefined entity 

为什么我没有看到resolveEntity打印为& NUM ;或跳过的条目打印&不是;?

我认为resolveEntity和skippedEntity只针对外部DTD。我通过修改XML得到了这个工作。

XML = """<?xml version="1.0" encoding="utf-8" ?> 
<!DOCTYPE test SYSTEM "external.dtd" > 
<test summary='step: &foo; &bar;'>Entity: &not;</test> 
""" 

external.dtd包含两个简单的实体声明。

<!ENTITY foo "bar"> 
<!ENTITY bar "foo"> 

而且,我摆脱了resolveEntity的。

此输出 -

TestHandler.startElement(), test : step: bar foo() 
TestHandler.skippedEntity: not 

希望这有助于。

+0

谢谢,我不知道DTD必须是外部的。 – Jonathan 2011-07-06 13:30:25

这是您的程序的一个修改版本,我希望是有道理的。它演示了调用所有TestHandler方法的情况。

import StringIO 
import xml.sax 
from xml.sax.handler import ContentHandler 

# Inheriting from EntityResolver and DTDHandler is not necessary 
class TestHandler(ContentHandler): 

    # This method is only called for external entities. Must return a value. 
    def resolveEntity(self, publicID, systemID): 
     print "TestHandler.resolveEntity(): %s %s" % (publicID, systemID) 
     return systemID 

    def skippedEntity(self, name): 
     print "TestHandler.skippedEntity(): %s" % (name) 

    def unparsedEntityDecl(self, name, publicID, systemID, ndata): 
     print "TestHandler.unparsedEntityDecl(): %s %s" % (publicID, systemID) 

    def startElement(self, name, attrs): 
     summary = attrs.get('summary', '') 
     print 'TestHandler.startElement():', summary 

def main(xml_string): 
    try: 
     parser = xml.sax.make_parser() 
     curHandler = TestHandler() 
     parser.setContentHandler(curHandler) 
     parser.setEntityResolver(curHandler) 
     parser.setDTDHandler(curHandler) 

     stream = StringIO.StringIO(xml_string) 
     parser.parse(stream) 
     stream.close() 
    except xml.sax.SAXParseException, e: 
     print "*** PARSER error: %s" % e 

XML = """<!DOCTYPE test SYSTEM "test.dtd"> 
<test summary='step: &num;'>Entity: &not;</test> 
""" 

main(XML) 

test.dtd包含:

<!ENTITY num "FOO"> 
<!ENTITY pic SYSTEM 'bar.gif' NDATA gif> 

输出:

TestHandler.resolveEntity(): None test.dtd 
TestHandler.unparsedEntityDecl(): None bar.gif 
TestHandler.startElement(): step: FOO 
TestHandler.skippedEntity(): not 

加成

据我所知,skippedEntity被称为只有当前使用外部DTD(至少我不能拿出一个反例;如果the documentation稍微清晰一些,那就太好了)。

Adam在他的回答中说,resolveEntity仅被称为外部DTD。但这不完全正确。在处理对在内部或外部DTD子集中声明的外部实体的引用时也会调用resolveEntity。例如:

<!DOCTYPE test [ 
<!ENTITY num SYSTEM "bar.txt"> 
]> 

其中跳回到bar.txt的内容可能是说,FOO。在这种情况下it is not possible to refer to the entity in an attribute value

+0

谢谢。如果没有外部DTD,有没有办法让skippedEntity被调用? – Jonathan 2011-07-06 13:33:21

+0

@Jonathan:我已经更新了我的答案。 – mzjn 2011-07-07 20:17:14