用python xml.sax解析XML实体
问题描述:
用xml解析XML使用xml.sax,但是我的代码无法捕获实体。为什么不skippedEntity()或resolveEntity()在以下报告:用python xml.sax解析XML实体
import os
import cStringIO
import xml.sax
from xml.sax.handler import ContentHandler,EntityResolver,DTDHandler
#Class to parse and run test XML files
class TestHandler(ContentHandler,EntityResolver,DTDHandler):
#SAX handler - Entity resolver
def resolveEntity(self,publicID,systemID):
print "TestHandler.resolveEntity: %s %s" % (publicID,systemID)
def skippedEntity(self, name):
print "TestHandler.skippedEntity: %s" % (name)
def unparsedEntityDecl(self,publicID,systemID,ndata):
print "TestHandler.unparsedEntityDecl: %s %s" % (publicID,systemID)
def startElement(self,name,attrs):
# name = string.lower(name)
summary = '' + attrs.get('summary','')
arg = '' + attrs.get('arg','')
print 'TestHandler.startElement(), %s : %s (%s)' % (name,summary,arg)
def run(xml_string):
try:
parser = xml.sax.make_parser()
stream = cStringIO.StringIO(xml_string)
curHandler = TestHandler()
parser.setContentHandler(curHandler)
parser.setDTDHandler(curHandler)
parser.setEntityResolver(curHandler)
parser.parse(stream)
stream.close()
except (xml.sax.SAXParseException), e:
print "*** PARSER error: %s" % e;
def main():
try:
XML = "<!DOCTYPE page[ <!ENTITY num 'foo'> ]><test summary='step: #'>Entity: ¬</test>"
run(XML)
except Exception, e:
print 'FATAL ERROR: %s' % (str(e))
if __name__== '__main__':
main()
运行时,我看到的是:
TestHandler.startElement(), step: foo()
*** PARSER error: <unknown>:1:36: undefined entity
为什么我没有看到resolveEntity打印为& NUM ;或跳过的条目打印&不是;?
答
我认为resolveEntity和skippedEntity只针对外部DTD。我通过修改XML得到了这个工作。
XML = """<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE test SYSTEM "external.dtd" >
<test summary='step: &foo; &bar;'>Entity: ¬</test>
"""
的external.dtd包含两个简单的实体声明。
<!ENTITY foo "bar">
<!ENTITY bar "foo">
而且,我摆脱了resolveEntity的。
此输出 -
TestHandler.startElement(), test : step: bar foo()
TestHandler.skippedEntity: not
希望这有助于。
答
这是您的程序的一个修改版本,我希望是有道理的。它演示了调用所有TestHandler
方法的情况。
import StringIO
import xml.sax
from xml.sax.handler import ContentHandler
# Inheriting from EntityResolver and DTDHandler is not necessary
class TestHandler(ContentHandler):
# This method is only called for external entities. Must return a value.
def resolveEntity(self, publicID, systemID):
print "TestHandler.resolveEntity(): %s %s" % (publicID, systemID)
return systemID
def skippedEntity(self, name):
print "TestHandler.skippedEntity(): %s" % (name)
def unparsedEntityDecl(self, name, publicID, systemID, ndata):
print "TestHandler.unparsedEntityDecl(): %s %s" % (publicID, systemID)
def startElement(self, name, attrs):
summary = attrs.get('summary', '')
print 'TestHandler.startElement():', summary
def main(xml_string):
try:
parser = xml.sax.make_parser()
curHandler = TestHandler()
parser.setContentHandler(curHandler)
parser.setEntityResolver(curHandler)
parser.setDTDHandler(curHandler)
stream = StringIO.StringIO(xml_string)
parser.parse(stream)
stream.close()
except xml.sax.SAXParseException, e:
print "*** PARSER error: %s" % e
XML = """<!DOCTYPE test SYSTEM "test.dtd">
<test summary='step: #'>Entity: ¬</test>
"""
main(XML)
test.dtd包含:
<!ENTITY num "FOO">
<!ENTITY pic SYSTEM 'bar.gif' NDATA gif>
输出:
TestHandler.resolveEntity(): None test.dtd
TestHandler.unparsedEntityDecl(): None bar.gif
TestHandler.startElement(): step: FOO
TestHandler.skippedEntity(): not
加成
据我所知,skippedEntity
被称为只有当前使用外部DTD(至少我不能拿出一个反例;如果the documentation稍微清晰一些,那就太好了)。
Adam在他的回答中说,resolveEntity
仅被称为外部DTD。但这不完全正确。在处理对在内部或外部DTD子集中声明的外部实体的引用时也会调用resolveEntity
。例如:
<!DOCTYPE test [
<!ENTITY num SYSTEM "bar.txt">
]>
其中跳回到bar.txt的内容可能是说,FOO
。在这种情况下it is not possible to refer to the entity in an attribute value。
谢谢,我不知道DTD必须是外部的。 – Jonathan 2011-07-06 13:30:25