XSLT - 通过分析文本字符串

问题描述：

我比较两个XML文件之后生成一个XML文件。它看起来这本添加节点，XSLT - 通过分析文本字符串

<doc> 
    <para><change flag="start"/><content>changed text</content><change flag="end"/> para text</para> <!--considerd as a change--> 
    <para><change flag="start"/><content>changed <t/>text</content><change flag="end"/> para text</para><!--considerd as a change--> 
    <para><change flag="start"/><content>(1)</content><change flag="end"/> para text</para><!--not considerd as a change--> 
    <para><change flag="start"/><content>i.</content><change flag="end"/> para text</para><!--not considerd as a change--> 
    <para><change flag="start"/><content>•</content><change flag="end"/> para text</para><!--not considerd as a change--> 
    <para><change flag="start"/><content> </content><change flag="end"/> para text</para><!--not considerd as a change--> 
    <para><change flag="start"/><content>(1) this is a <t/> numberd list</content><change flag="end"/> para text</para><!--considerd as a change--> 
    <para><change flag="start"/><content>• this is a <t/> bullet list</content><change flag="end"/>para text</para><!--considerd as a change--> 
</doc>

这里<change>元素显示两个文件的差异和变化的内容是显示在<change flag="start"/>和<change flag="end"/>元素之间。

我的要求是将其转换为html。并且<change flag="start"/>和<change flag="end"/>之间的内容（与两个xml文件的差异）应该用<CH>元素覆盖。

<html> 
    <head></head> 
    <body> 
     <p><CH>changed text</CH>para text</p> 
     <p><CH>changed text</CH>para text</p> 
     <p><CH>(1)</CH>para text</p> 
     <p><CH>i.</CH>para text</p> 
     <p><CH>•</CH>para text</p> 
     <p><CH> </CH>para text</p> 
     <p><CH>(1) this is a numberd list</CH>para text</p> 
     <p><CH>• this is a bullet list</CH>para text</p> 
    </body> 
</html>

但这里的问题是<change flag="start"/>和<change flag="end"/>加子弹，列表编号和一些空格。 eventhout当比较xml文件的html表示时，这些变化不应被视为更改。

所以我真正期待的HTML输出，

<html> 
    <head></head> 
    <body> 
     <p><CH>changed text</CH> para text</p> 
     <p><CH>changed text</CH> para text</p> 
     <p>(1) para text</p> 
     <p>(a) para text</p> 
     <p>• para text</p> 
     <p> para text</p> 
     <p><CH>(1) this is a numberd list</CH> para text</p> 
     <p><CH>• this is a bullet list</CH> para text</p> 
    </body> 
</html>

我写了下面的XSLT做这个任务，

<xsl:template match="doc"> 
     <html> 
      <head></head> 
      <body> 
       <xsl:apply-templates/> 
      </body> 
     </html> 
    </xsl:template> 


    <xsl:template match="para"> 
     <p> 
      <xsl:apply-templates/> 
     </p> 
    </xsl:template> 


    <xsl:template match="*[preceding-sibling::change[@flag='start'] and following-sibling::change[@flag = 'end']] 
     [matches(.,$list.mapping/map/@numerator-regex)]"> 
     <CH> 
      <xsl:apply-templates/> 
     </CH> 
    </xsl:template> 


<xsl:variable name="list.mapping" as="element()*"> 
    <map numerator-regex="^\(\d\)"/> 
    <map numerator-regex="^\(\d\d\)"/> 
    <map numerator-regex="^\d\)"/> 
    <map numerator-regex="^\d\."/> 
    <map numerator-regex="^\([A-Za-z]\.\)"/> 
    <map numerator-regex="^•"/> 
    <map numerator-regex="^*"/> 
</xsl:variable> 


    <xsl:template match="content"> 
     <xsl:apply-templates/> 
    </xsl:template>

但如预期，这是现在的工作..任何一个可以建议我如何做到这一点，特别是如何消除为以下情况添加标签，

bullets（•）[bullet contains <change flag="start"/>和<change flag="end"/>]
列表编号（1），（一）[列表号包含<change flag="start"/>和<change flag="end"/>之间]
空格[空格包含<change flag="start"/>和

有多复杂，可以在单个'para'元素中有多个'change'开始/结束元素？什么是包装或不包装的确切标准，为什么有一个例子，在开始的时候被包装而另一个未包装呢？你能定义一个有限的，明确定义的正则表达式模式列表，它匹配要包装的输入吗？ –

@MartinHonnen，是的，它可以在单个段落中有多个变更开始/结束元素。如果更改仅为子弹（不是该子弹点内的任何文本），则不会将其视为更改。我更新了问题中可能的正则表达式。 – sanjay

答

首先之间<change flag="end"/>]，我认为你需要将您的list.mapping变量更改为包含$符号。 '^'匹配文本的开头，$匹配文本的结尾。这将停止^$\d$匹配(1) this is a numberd list。

<xsl:variable name="list.mapping" as="element()*"> 
    <map numerator-regex="^\(\d\)$"/> 
    <map numerator-regex="^\(\d\d\)$"/> 
    <map numerator-regex="^\d\)$"/> 
    <map numerator-regex="^\d\.$"/> 
    <map numerator-regex="^\([A-Za-z]\.\)$"/> 
    <map numerator-regex="^•$"/> 
    <map numerator-regex="^\*$"/> 
</xsl:variable>

因为你使用as="element()*"。至于匹配，这意味着你应该做$list.mapping/@numerator-regex，而不是$list.mapping/map/@numerator-regex。你也应该检查不匹配的东西。

你想要的是这种状况...

[not($list.mapping/@numerator-regex[matches(current(), .)])]

试试这个XSLT

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> 
<xsl:template match="doc"> 
     <html> 
      <head></head> 
      <body> 
       <xsl:apply-templates/> 
      </body> 
     </html> 
    </xsl:template> 

    <xsl:template match="para"> 
     <p> 
      <xsl:apply-templates/> 
     </p> 
    </xsl:template> 

    <xsl:template match="*[preceding-sibling::change[@flag='start'] and following-sibling::change[@flag = 'end']] 
     [not($list.mapping/@numerator-regex[matches(current(), .)])]"> 
     <CH> 
      <xsl:apply-templates/> 
     </CH> 
    </xsl:template> 

<xsl:variable name="list.mapping" as="element()*"> 
    <map numerator-regex="^\(\d\)$"/> 
    <map numerator-regex="^\(\d\d\)$"/> 
    <map numerator-regex="^\d\)$"/> 
    <map numerator-regex="^\d\.$"/> 
    <map numerator-regex="^\([A-Za-z]\.\)$"/> 
    <map numerator-regex="^•$"/> 
    <map numerator-regex="^\*$"/> 
</xsl:variable> 
</xsl:stylesheet>

这可能不是给了你所需要的准确输出，作为输入XML可能有一些隐藏的Unicode那些字符会影响事物，但它可能会给你一个开始。

XSLT - 通过分析文本字符串

相关推荐