在XML中解析数据并在Python中存储到数据库

问题描述:

嗨,大家好我有解析XML文件并输入数据到sqlite时的问题,格式就像我需要在象111,AAA,BBB等令牌之前输入字符在XML中解析数据并在Python中存储到数据库

<DOCUMENT> 
<PAGE width="544.252" height="634.961" number="1" id="p1"> 
<MEDIABOX x1="0" y1="0" x2="544.252" y2="634.961"/> 

<BLOCK id="p1_b1"> 

<TEXT width="37.7" height="74.124" id="p1_t1" x="51.1" y="20.8652"> 
<TOKEN sid="p1_s11" id="p1_w1" font-name="Verdanae" bold="yes" italic="no">111</TOKEN> 
</TEXT> 
</BLOCK> 

<BLOCK id="p1_b3"> 

<TEXT width="151.267" height="10.725" id="p1_t6" x="24.099" y="572.096"> 
<TOKEN sid="p1_s35" id="p1_w22" font-name="Verdanae" bold="yes"  italic="yes">AAA</TOKEN> 
<TOKEN sid="p1_s36" id="p1_w23" font-name="verdanae" bold="yes" italic="no">BBB</TOKEN> 
<TOKEN sid="p1_s37" id="p1_w24" font-name="verdanae" bold="yes" italic="no">CCC</TOKEN> 
</TEXT> 
</BLOCK> 

<BLOCK id="p1_b4"> 

<TEXT width="82.72" height="26" id="p1_t7" x="55.426" y="138.026"> 
<TOKEN sid="p1_s42" id="p1_w29" font-name="verdanae" bold="yes" italic="no">DDD</TOKEN> 
<TOKEN sid="p1_s43" id="p1_w30" font-name="verdanae" bold="yes" italic="no">EEE</TOKEN> 
</TEXT> 

<TEXT width="101.74" height="26" id="p1_t8" x="55.406" y="162.026"> 
<TOKEN sid="p1_s45" id="p1_w31" font-name="verdanae" bold="yes" italic="no">FFF</TOKEN> 
</TEXT> 

<TEXT width="152.96" height="26" id="p1_t9" x="55.406" y="186.026"> 
<TOKEN sid="p1_s47" id="p1_w32" font-name="verdanae" bold="yes" italic="no">GGG</TOKEN> 
<TOKEN sid="p1_s48" id="p1_w33" font-name="verdanae" bold="yes" italic="no">HHH</TOKEN> 
</TEXT> 
</BLOCK> 
</PAGE> 
</DOCUMENT> 
在.NET

它与3的foreach循环做1.“文档/页/块” 2“TEXT” 3.“令牌”,然后将其输入到DB我不知道要怎么弄它在Python和我与LXML模块尝试它

+0

您的意思是您需要获取所有标记值?像['111','BBB','EEE']或[['111'],['BBB','EEE']] – virhilo 2011-01-09 10:40:23

你的意思是这个?:

>>> xml = """<DOCUMENT> 
... <PAGE width="544.252" height="634.961" number="1" id="p1"> 
... <MEDIABOX x1="0" y1="0" x2="544.252" y2="634.961"/> 
... 
... <BLOCK id="p1_b1"> 
... 
... <TEXT width="37.7" height="74.124" id="p1_t1" x="51.1" y="20.8652"> 
... <TOKEN sid="p1_s11" id="p1_w1" font-name="Verdanae" bold="yes" italic="no">111</TOKEN> 
... </TEXT> 
... </BLOCK> 
... 
... <BLOCK id="p1_b3"> 
... 
... <TEXT width="151.267" height="10.725" id="p1_t6" x="24.099" y="572.096"> 
... <TOKEN sid="p1_s35" id="p1_w22" font-name="Verdanae" bold="yes"  italic="yes">AAA</TOKEN> 
... <TOKEN sid="p1_s36" id="p1_w23" font-name="verdanae" bold="yes" italic="no">BBB</TOKEN> 
... <TOKEN sid="p1_s37" id="p1_w24" font-name="verdanae" bold="yes" italic="no">CCC</TOKEN> 
... </TEXT> 
... </BLOCK> 
... 
... <BLOCK id="p1_b4"> 
... 
... <TEXT width="82.72" height="26" id="p1_t7" x="55.426" y="138.026"> 
... <TOKEN sid="p1_s42" id="p1_w29" font-name="verdanae" bold="yes" italic="no">DDD</TOKEN> 
... <TOKEN sid="p1_s43" id="p1_w30" font-name="verdanae" bold="yes" italic="no">EEE</TOKEN> 
... </TEXT> 
... 
... <TEXT width="101.74" height="26" id="p1_t8" x="55.406" y="162.026"> 
... <TOKEN sid="p1_s45" id="p1_w31" font-name="verdanae" bold="yes" italic="no">FFF</TOKEN> 
... </TEXT> 
... 
... <TEXT width="152.96" height="26" id="p1_t9" x="55.406" y="186.026"> 
... <TOKEN sid="p1_s47" id="p1_w32" font-name="verdanae" bold="yes" italic="no">GGG</TOKEN> 
... <TOKEN sid="p1_s48" id="p1_w33" font-name="verdanae" bold="yes" italic="no">HHH</TOKEN> 
... </TEXT> 
... </BLOCK> 
... </PAGE> 
... </DOCUMENT>""" 
>>> from lxml import etree 
>>> parsed = etree.fromstring(xml) 
>>> tokens = parsed.xpath('//TOKEN/text()') 
>>> tokens 
['111', 'AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG', 'HHH'] 
>>> 

or this ?:

>>> parsed = etree.fromstring(xml) 
>>> for block in parsed.xpath('//PAGE/BLOCK/TEXT'): 
...  print block.xpath('./TOKEN/text()') 
... 
['111'] 
['AAA', 'BBB', 'CCC'] 
['DDD', 'EEE'] 
['FFF'] 
['GGG', 'HHH'] 
>>> 
+0

我用同样的方法尝试过,但我得到了一个空的列表, 。没有添加到“/ TOKEN/text()”为什么你添加点它做了什么.....无论如何感谢很多老兄 – Rakesh 2011-01-09 14:15:48