在编写XML文件时(使用Python)
我必须监视一整天运行的工具所写的XML文件。但是XML文件只能在一天结束时正确完成并关闭。在编写XML文件时(使用Python)
相同的约束作为XML流处理:
- 解析上即时不完整的XML文件,并触发动作
- 从一开始就再次保留在文件中轨道的最后一个位置,以避免处理它
论Need to read XML files as a stream using BeautifulSoup in Python答案,slezica建议xml.sax
,xml.etree.ElementTree
和cElementTree
。但是,我尝试使用xml.etree.ElementTree
和cElementTree
没有成功。也有xml.dom
,xml.parsers.expat
和lxml
但我没有看到支持“即时解析”。
我需要更明显的例子...
我目前正在使用Python 2.7在Linux上,但我会迁移到Python 3.x都有=>也请提供新的Python 3.x的功能提示。我还使用watchdog
来检测XML文件修改=>可以重复使用watchdog
机制。也可以选择支持Windows。
请提供易于理解/维护解决方案。如果它太复杂,我可能只使用tell()
/seek()
在文件中移动,在原始XML中使用愚蠢的文本搜索,最后使用基本正则表达式提取值。
XML示例:
<dfxml xmloutputversion='1.0'>
<creator version='1.0'>
<program>TCPFLOW</program>
<version>1.4.6</version>
</creator>
<configuration>
<fileobject>
<filename>file1</filename>
<filesize>288</filesize>
<tcpflow packets='12' srcport='1111' dstport='2222' family='2' />
</fileobject>
<fileobject>
<filename>file2</filename>
<filesize>352</filesize>
<tcpflow packets='12' srcport='3333' dstport='4444' family='2' />
</fileobject>
<fileobject>
<filename>file3</filename>
<filesize>456</filesize>
...
...
首先测试使用SAX失败:
import xml.sax
class StreamHandler(xml.sax.handler.ContentHandler):
def startElement(self, name, attrs):
print 'start: name=', name
def endElement(self, name):
print 'end: name=', name
if name == 'root':
raise StopIteration
if __name__ == '__main__':
parser = xml.sax.make_parser()
parser.setContentHandler(StreamHandler())
with open('f.xml') as f:
parser.parse(f)
外壳:
$ while read line; do echo $line; sleep 1; done <i.xml >f.xml &
...
$ ./test-using-sax.py
start: name= dfxml
start: name= creator
start: name= program
end: name= program
start: name= version
end: name= version
Traceback (most recent call last):
File "./test-using-sax.py", line 17, in <module>
parser.parse(f)
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib64/python2.7/xml/sax/xmlreader.py", line 125, in parse
self.close()
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 220, in close
self.feed("", isFinal = 1)
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 214, in feed
self._err_handler.fatalError(exc)
File "/usr/lib64/python2.7/xml/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: report.xml:15:0: no element found
三小时在发布我的问题后,没有收到答复。但是我终于实现了我正在寻找的简单例子。
我的灵感来自saaj的answer并且基于xml.sax
和watchdog
。
from __future__ import print_function, division
import time
import watchdog.events
import watchdog.observers
import xml.sax
class XmlStreamHandler(xml.sax.handler.ContentHandler):
def startElement(self, tag, attributes):
print(tag, 'attributes=', attributes.items())
self.tag = tag
def characters(self, content):
print(self.tag, 'content=', content)
class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
def __init__(self):
watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
self.file = None
self.parser = xml.sax.make_parser()
self.parser.setContentHandler(XmlStreamHandler())
def on_modified(self, event):
if not self.file:
self.file = open(event.src_path)
self.parser.feed(self.file.read())
if __name__ == '__main__':
observer = watchdog.observers.Observer()
event_handler = XmlFileEventHandler()
observer.schedule(event_handler, path='.')
try:
observer.start()
while True:
time.sleep(10)
finally:
observer.stop()
observer.join()
当脚本运行时,不要忘记touch
一个XML文件,或者使用下面的命令模拟上即时写作:
while read line; do echo $line; sleep 1; done <in.xml >out.xml &
从昨天开始,我发现了Peter Gibson“ s answer关于无证xml.etree.ElementTree.XMLTreeBuilder._parser.EndElementHandler
。
本示例与另一个示例类似,但使用xml.etree.ElementTree
(和watchdog
)。
时ElementTree
被cElementTree
取代它不工作: -/
import time
import watchdog.events
import watchdog.observers
import xml.etree.ElementTree
class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
def __init__(self):
watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
self.xml_file = None
self.parser = xml.etree.ElementTree.XMLTreeBuilder()
def end_tag_event(tag):
node = self.parser._end(tag)
print 'tag=', tag, 'node=', node
self.parser._parser.EndElementHandler = end_tag_event
def on_modified(self, event):
if not self.xml_file:
self.xml_file = open(event.src_path)
buffer = self.xml_file.read()
if buffer:
self.parser.feed(buffer)
if __name__ == '__main__':
observer = watchdog.observers.Observer()
event_handler = XmlFileEventHandler()
observer.schedule(event_handler, path='.')
try:
observer.start()
while True:
time.sleep(10)
finally:
observer.stop()
observer.join()
当脚本运行时,不要忘记touch
一个XML文件,或使用本模拟上即时写作一个行脚本:
while read line; do echo $line; sleep 1; done <in.xml >out.xml &
有关信息,该xml.etree.ElementTree.iterparse
似乎并不支持写入的文件。我的测试代码:
from __future__ import print_function, division
import xml.etree.ElementTree
if __name__ == '__main__':
context = xml.etree.ElementTree.iterparse('f.xml', events=('end',))
for action, elem in context:
print(action, elem.tag)
我的输出:
end program
end version
end creator
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
Traceback (most recent call last):
File "./iter.py", line 9, in <module>
for action, elem in context:
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1281, in next
self._root = self._parser.close()
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 20, column 0