如何使用XQuery连续的标签转换成标签嵌套或表
问题描述:
我有连续的标签,而不是嵌套标签的XML文件,如下所示:如何使用XQuery连续的标签转换成标签嵌套或表
<title>
<subtitle>
<topic att="TopicTitle">Topic title 1</topic>
<content att="TopicSubtitle">topic subtitle 1</content>
<content att="Paragraph">paragraph text 1</content>
<content att="Paragraph">paragraph text 2</content>
<content att="TopicSubtitle">topic subtitle 2</content>
<content att="Paragraph">paragraph text 1</content>
<content att="Paragraph">paragraph text 2</content>
<topic att="TopicTitle">Topic title 2</topic>
<content att="TopicSubtitle">topic subtitle 1</content>
<content att="Paragraph">paragraph text 1</content>
<content att="Paragraph">paragraph text 2</content>
<content att="TopicSubtitle">topic subtitle 2</content>
<content att="Paragraph">paragraph text 1</content>
<content att="Paragraph">paragraph text 2</content>
</subtitle>
</title>
我使用XQuery在BaseX,我想将它与下面的列转换为表格:
Title Subtitle TopicTitle TopicSubtitle Paragraph
Irrelevant Irrelevant Topic title 1 Topic Subtitle 1 paragraph text 1
Irrelevant Irrelevant Topic title 1 Topic Subtitle 1 paragraph text 2
Irrelevant Irrelevant Topic title 1 Topic Subtitle 2 paragraph text 1
Irrelevant Irrelevant Topic title 1 Topic Subtitle 2 paragraph text 2
Irrelevant Irrelevant Topic title 2 Topic Subtitle 1 paragraph text 1
Irrelevant Irrelevant Topic title 2 Topic Subtitle 1 paragraph text 2
Irrelevant Irrelevant Topic title 2 Topic Subtitle 2 paragraph text 1
Irrelevant Irrelevant Topic title 2 Topic Subtitle 2 paragraph text 2
我是新来的XQuery和XPath,但我已经明白如何通过节点导航的基本知识,并选择我需要的人。我还不知道的是如何处理我想要转换为嵌套XML或表格(CSV?)的连续数据。谁能帮忙?
答
例如,您可以使用tumbling window
(https://www.w3.org/TR/xquery-30/#id-windows)将平面XML转换为嵌套XML。
for tumbling window $w in title/subtitle/*
start $t when $t instance of element(topic)
return
<topic
title="{$t/@att}">
{
for tumbling window $content in tail($w)
start $c when $c/@att = 'TopicSubtitle'
return
<subtopic
title="{$c/@att}">
{
tail($content) ! <para>{node()}</para>
}
</subtopic>
}
</topic>
给出了基于该
<topic title="TopicTitle">
<subtopic title="TopicSubtitle">
<para>paragraph text 1</para>
<para>paragraph text 2</para>
</subtopic>
<subtopic title="TopicSubtitle">
<para>paragraph text 1</para>
<para>paragraph text 2</para>
</subtopic>
</topic><topic title="TopicTitle">
<subtopic title="TopicSubtitle">
<para>paragraph text 1</para>
<para>paragraph text 2</para>
</subtopic>
<subtopic title="TopicSubtitle">
<para>paragraph text 1</para>
<para>paragraph text 2</para>
</subtopic>
</topic>
我想,那么你可以将整个与
string-join(
<title>
<subtitle>
{
for tumbling window $w in title/subtitle/*
start $t when $t instance of element(topic)
return
<topic
title="{$t/@att}"
value="{$t}">
{
for tumbling window $content in tail($w)
start $c when $c/@att = 'TopicSubtitle'
return
<subtopic
title="{$c/@att}"
value="{$c}">
{
tail($content) ! <para>{node()}</para>
}
</subtopic>
}
</topic>
}
</subtitle>
</title>//para ! string-join(ancestor-or-self::* ! (text(), @value, 'Irrelevant')[1], ';'), ' ')
答
以分号分隔的数据虽然位置分组就是这种最普通的方法问题(就像Martin Honnen所描述的那样,XQuery 3.0+中的窗口翻滚,XSLT 2.0+中的for-each-group/@group-starting-with
)我认为这不是必须的,因为你不是实际上试图利用数据中隐含的分层结构。
具体来说,要转换一个平面结构与层次隐到另一个平面结构与层次隐,你可以做到这一点大意如下的内容:
<table>{
for $para in title/subtitle/content[@att='paragraph']
return <row>
<cell>irrelevant</cell>
<cell>irrelevant</cell>
<cell>{$para/preceding-sibling::topic[1]/string()}</cell>
<cell>{$para/preceding-sibling::content[@att='TopicSubtitle'][1]/string()}</cell>
<cell>{$para/string()}</cell>
</row>
}</table>
这是伟大的。正是我需要的。在研究了更多关于翻滚窗口之后,我怀疑自己能够找到它。花了一点时间适应我的文件,但它现在正在与几个嵌套滚动窗口工作。因为它看起来有点肮脏,所以我想问,你知道有更好的方法来做到这一点吗?我的意思是,使用Java,Python或其他语言更适合这类任务?感谢您的帮助! – ChuyTM
对于那些主要在做XSLT的人(在这里你可以使用嵌套的'xsl:for-each-group group-starting-with'),它已经使用XQuery感觉“脏”了,但我认为这些语言是处理XML的好选择。如果您正在寻找更好的结构来将XML与XQuery转换为CSV,请查看https://github.com/CliffordAnderson/XQuery4Humanists/blob/master/05-Generating-JSON-and-CSV.md。至于Python,我不太了解Python,即使我知道我认为它将取决于您可以安装哪个模块。 –
对于纯Java和内置的XML类,我认为它需要很多代码,我不知道Java 8的流/分组足够好以估计它需要的代码量, –