在编写XML文件时(使用Python)

问题描述:

我必须监视一整天运行的工具所写的XML文件。但是XML文件只能在一天结束时正确完成并关闭。在编写XML文件时(使用Python)

相同的约束作为XML流处理:

  1. 解析上即时不完整的XML文件,并触发动作
  2. 从一开始就再次保留在文件中轨道的最后一个位置,以避免处理它

Need to read XML files as a stream using BeautifulSoup in Python答案,slezica建议xml.saxxml.etree.ElementTreecElementTree。但是,我尝试使用xml.etree.ElementTreecElementTree没有成功。也有xml.dom,xml.parsers.expatlxml但我没有看到支持“即时解析”

我需要更明显的例子...

我目前正在使用Python 2.7在Linux上,但我会迁移到Python 3.x都有=>也请提供新的Python 3.x的功能提示。我还使用watchdog来检测XML文件修改=>可以重复使用watchdog机制。也可以选择支持Windows。

请提供易于理解/维护解决方案。如果它太复杂,我可能只使用tell()/seek()在文件中移动,在原始XML中使用愚蠢的文本搜索,最后使用基本正则表达式提取值。


XML示例:

<dfxml xmloutputversion='1.0'> 
    <creator version='1.0'> 
    <program>TCPFLOW</program> 
    <version>1.4.6</version> 
    </creator> 
    <configuration> 
    <fileobject> 
     <filename>file1</filename> 
     <filesize>288</filesize> 
     <tcpflow packets='12' srcport='1111' dstport='2222' family='2' /> 
    </fileobject> 
    <fileobject> 
     <filename>file2</filename> 
     <filesize>352</filesize> 
     <tcpflow packets='12' srcport='3333' dstport='4444' family='2' /> 
    </fileobject> 
    <fileobject> 
     <filename>file3</filename> 
     <filesize>456</filesize> 
     ... 
     ... 

首先测试使用SAX失败:

import xml.sax 

class StreamHandler(xml.sax.handler.ContentHandler): 
    def startElement(self, name, attrs): 
     print 'start: name=', name 
    def endElement(self, name): 
     print 'end: name=', name 
     if name == 'root': 
      raise StopIteration 

if __name__ == '__main__': 
    parser = xml.sax.make_parser() 
    parser.setContentHandler(StreamHandler()) 
    with open('f.xml') as f: 
     parser.parse(f) 

外壳:

$ while read line; do echo $line; sleep 1; done <i.xml >f.xml & 
... 
$ ./test-using-sax.py 
start: name= dfxml 
start: name= creator 
start: name= program 
end: name= program 
start: name= version 
end: name= version 
Traceback (most recent call last): 
    File "./test-using-sax.py", line 17, in <module> 
    parser.parse(f) 
    File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 107, in parse 
    xmlreader.IncrementalParser.parse(self, source) 
    File "/usr/lib64/python2.7/xml/sax/xmlreader.py", line 125, in parse 
    self.close() 
    File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 220, in close 
    self.feed("", isFinal = 1) 
    File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 214, in feed 
    self._err_handler.fatalError(exc) 
    File "/usr/lib64/python2.7/xml/sax/handler.py", line 38, in fatalError 
    raise exception 
xml.sax._exceptions.SAXParseException: report.xml:15:0: no element found 

三小时在发布我的问题后,没有收到答复。但是我终于实现了我正在寻找的简单例子。

我的灵感来自saajanswer并且基于xml.saxwatchdog

from __future__ import print_function, division 
import time 
import watchdog.events 
import watchdog.observers 
import xml.sax 

class XmlStreamHandler(xml.sax.handler.ContentHandler): 
    def startElement(self, tag, attributes): 
    print(tag, 'attributes=', attributes.items()) 
    self.tag = tag 
    def characters(self, content): 
    print(self.tag, 'content=', content) 

class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler): 
    def __init__(self): 
    watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml']) 
    self.file = None 
    self.parser = xml.sax.make_parser() 
    self.parser.setContentHandler(XmlStreamHandler()) 
    def on_modified(self, event): 
    if not self.file: 
     self.file = open(event.src_path) 
    self.parser.feed(self.file.read()) 

if __name__ == '__main__': 
    observer = watchdog.observers.Observer() 
    event_handler = XmlFileEventHandler() 
    observer.schedule(event_handler, path='.') 
    try: 
    observer.start() 
    while True: 
     time.sleep(10) 
    finally: 
    observer.stop() 
    observer.join() 

当脚本运行时,不要忘记touch一个XML文件,或者使用下面的命令模拟上即时写作:

while read line; do echo $line; sleep 1; done <in.xml >out.xml & 

从昨天开始,我发现了Peter Gibson“ s answer关于无证xml.etree.ElementTree.XMLTreeBuilder._parser.EndElementHandler

本示例与另一个示例类似,但使用xml.etree.ElementTree(和watchdog)。

ElementTreecElementTree取代它不工作: -/

import time 
import watchdog.events 
import watchdog.observers 
import xml.etree.ElementTree 

class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler): 
    def __init__(self): 
     watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml']) 
     self.xml_file = None 
     self.parser = xml.etree.ElementTree.XMLTreeBuilder() 
     def end_tag_event(tag): 
      node = self.parser._end(tag) 
      print 'tag=', tag, 'node=', node 
     self.parser._parser.EndElementHandler = end_tag_event 

    def on_modified(self, event): 
     if not self.xml_file: 
      self.xml_file = open(event.src_path) 
     buffer = self.xml_file.read() 
     if buffer: 
      self.parser.feed(buffer) 

if __name__ == '__main__': 
    observer = watchdog.observers.Observer() 
    event_handler = XmlFileEventHandler() 
    observer.schedule(event_handler, path='.') 
    try: 
     observer.start() 
     while True: 
      time.sleep(10) 
    finally: 
     observer.stop() 
     observer.join() 

当脚本运行时,不要忘记touch一个XML文件,或使用本模拟上即时写作一个行脚本:

while read line; do echo $line; sleep 1; done <in.xml >out.xml & 

有关信息,该xml.etree.ElementTree.iterparse似乎并不支持写入的文件。我的测试代码:

from __future__ import print_function, division 
import xml.etree.ElementTree 

if __name__ == '__main__': 
    context = xml.etree.ElementTree.iterparse('f.xml', events=('end',)) 
    for action, elem in context: 
     print(action, elem.tag) 

我的输出:

end program 
end version 
end creator 
end filename 
end filesize 
end tcpflow 
end fileobject 
end filename 
end filesize 
end tcpflow 
end fileobject 
end filename 
end filesize 
Traceback (most recent call last): 
    File "./iter.py", line 9, in <module> 
    for action, elem in context: 
    File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1281, in next 
    self._root = self._parser.close() 
    File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close 
    self._raiseerror(v) 
    File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror 
    raise err 
xml.etree.ElementTree.ParseError: no element found: line 20, column 0