如何修改R中的顶级XML节点?
我想添加一个属性到XML文件的最顶层节点,然后保存该文件。我已经尝试过所有可以考虑的xpath和子集的组合,但似乎无法使其工作。用一个简单的例子:如何修改R中的顶级XML节点?
xml_string = c(
'<?xml version="1.0" encoding="UTF-8"?>',
'<retrieval-response status = "found">',
'<coredata>',
'<id type = "author" >12345</id>',
'</coredata>',
'<author>',
'<first>John</first>',
'<last>Doe</last>',
'</author>',
'</retrieval-response>')
# parse xml content
xml = xmlParse(xml_string)
当我尝试
xmlAttrs(xml["/retrieval-response"][[1]]) <- c(id = 12345)
我得到一个错误:
object of type 'externalptr' is not subsettable
然而,属性插入,所以我不知道我做错了。 (更多背景:这是来自Scopus API的数据的简化版本,我将数以千计的xml文件结构相似,但id在“coredata”节点,它是“作者”节点的同胞其中包含所有的数据,所以当我使用SAS将组合XML文档编译为数据集时,id和数据之间没有链接,我希望将id添加到层次结构的顶部会导致它传播到所有其他级别)。
编辑: 试图编辑顶部节点的方法后(见Old Answer),我意识到编辑顶层节点并不能解决我的问题,因为SAS XML映射器没有保留所有的ID。
我试着将作者id添加到每个完美工作的子节点的新方法。我还了解到,您可以使用XPath通过将它们放入一个载体,像这样选择多个节点:
c("//coredata",
"//affiliation-current",
"affiliation-history",
"subject-areas",
"//author-profile")
所以我用最后的方案是:
files <- list.files()
for (i in 1:length(files)) {
author_record <- xmlParse(files[i])
xpathApply(
author_record, c(
"//coredata",
"//affiliation-current",
"affiliation-history",
"subject-areas",
"//author-profile"
),
addAttributes,
auth_id = gsub("AUTHOR_ID:", "", xmlValue(author_record[["//dc:identifier"]]))
)
saveXML(author_record, file = files[i])
}
老答案: 经过多次实验,我发现了一个相对简单的解决方案来解决我的问题
属性可以通过简单地使用
addAttributes(xmlRoot(xmlfile), attribute = "attributeValue")
对于我的具体情况下被添加到顶级节点,最简单的解决方案将是一个简单的循环:
setwd("C:/directory/with/individual/xmlfiles")
files <- list.files()
for (i in 1:length(files)) {
author_record <- xmlParse(files[i])
addAttributes(node = xmlRoot(author_record),
id = gsub (pattern = "AUTHOR_ID:",
replacement = "",
x = xmlValue(auth[["//dc:identifier"]])
)
)
saveXML(author_record, file = files[i])
}
我敢肯定有更好的方法。显然我需要学习XLST,这是一个非常强大的方法!
为了根据数据集和数据框的结构将XML数据迁移到行和列的二维中,必须删除所有嵌套以仅迭代父级和一个子级。因此,XSLT是专门针对任何细微差别需要重构XML文档的专用声明性编程语言,它们可以用来重构XML数据以满足最终用户的需求。
给出您的示例XML,下面是一个可运行的XSLT,并将结果XML成功导入SAS。让SAS代码循环重构所有数千个XML文件。
XSLT(保存为的.xsl或.xslt格式)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:ait="http://www.elsevier.com/xml/ani/ait"
xmlns:ce="http://www.elsevier.com/xml/ani/common"
xmlns:cto="http://www.elsevier.com/xml/cto/dtd"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:ns1="http://webservices.elsevier.com/schemas/search/fast/types/v4"
xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/"
xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd"
xmlns:xoe="http://www.elsevier.com/xml/xoe/dtd"
exclude-result-prefixes="ait ce cto dc ns1 prism xocs xoe">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:template match="author-retrieval-response">
<xsl:variable select="substring-after(coredata/dc:identifier, ':')" name="authorid"/>
<root>
<coredata>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="coredata/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="concat(.,@href)"/>
</xsl:element>
</xsl:for-each>
</coredata>
<subjectAreas>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="subject-areas/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</subjectAreas>
<authorname>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/preferred-name/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</authorname>
<classifications>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/classificationgroup/classifications/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</classifications>
<journals>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/journal-history/journal/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</journals>
<ipdoc>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/*[not(local-name()='address')]">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</ipdoc>
<address>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/address/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</address>
</root>
</xsl:template>
</xsl:transform>
SAS(使用上述脚本)
proc xsl
in="C:\Path\To\Original.xml"
out="C:\Path\To\Output.xml"
xsl="C:\Path\To\XSLT.xsl";
run;
** STORING XML CONTENT;
libname temp xml 'C:\Path\To\Output.xml';
** APPEND CONTENT TO SAS DATASETS;
data Work.Coredata;
retain authorid;
set temp.Coredata; ** NAME OF PARENT NODE IN XML;
run;
data Work.SubjectAreas;
retain authorid;
set temp.SubjectAreas; ** NAME OF PARENT NODE IN XML;
run;
data Work.Authorname;
retain authorid;
set temp.Authorname; ** NAME OF PARENT NODE IN XML;
run;
data Work.Classifications;
retain authorid;
set temp.Classifications; ** NAME OF PARENT NODE IN XML;
run;
data Work.Journals;
retain authorid;
set temp.Journals; ** NAME OF PARENT NODE IN XML;
run;
data Work.Ipdoc;
retain authorid;
set temp.Ipdoc; ** NAME OF PARENT NODE IN XML;
run;
XML OUTPUT(其被导入作为一行和40个变量的Authorsdata数据集)
<?xml version="1.0" encoding="UTF-8"?>
<root>
<coredata>
<authorid>1234567</authorid>
<url>http://api.elsevier.com/content/author/author_id/1234567</url>
<identifier>AUTHOR_ID:1234567</identifier>
<eid>9-s2.0-1234567</eid>
<document-count>3</document-count>
<cited-by-count>95</cited-by-count>
<citation-count>97</citation-count>
<link>http://api.elsevier.com/content/search/scopus?query=refauid%1234567%29</link>
<link>http://www.scopus.com/authid/detail.url?partnerID=HzOxMe3b&authorId=1234567&origin=inward</link>
<link>http://api.elsevier.com/content/author/author_id/1234567</link>
<link>http://api.elsevier.com/content/search/scopus?query=au-id%281234567%29</link>
</coredata>
<subjectAreas>
<authorid>1234567</authorid>
<subject-area>Human-Computer Interaction</subject-area>
<subject-area>Control and Systems Engineering</subject-area>
<subject-area>Software</subject-area>
<subject-area>Computer Vision and Pattern Recognition</subject-area>
<subject-area>Artificial Intelligence</subject-area>
</subjectAreas>
<authorname>
<authorid>1234567</authorid>
<initials>A.</initials>
<indexed-name>John A.</indexed-name>
<surname>John</surname>
<given-name>Doe</given-name>
</authorname>
<classifications>
<authorid>1234567</authorid>
<classification>1709</classification>
<classification>2207</classification>
<classification>1712</classification>
<classification>1707</classification>
<classification>1702</classification>
</classifications>
<journals>
<authorid>1234567</authorid>
<sourcetitle>Very Prestigious Journal</sourcetitle>
<sourcetitle-abbrev>V PRES JOU Autom</sourcetitle-abbrev>
<issn>10504729</issn>
<sourcetitle>2005 Another Prestigious Journal</sourcetitle>
<sourcetitle-abbrev>An. Prest. Jou. </sourcetitle-abbrev>
</journals>
<ipdoc>
<authorid>1234567</authorid>
<afnameid>Prestigious University#1111111</afnameid>
<afdispname>Prestigious University University</afdispname>
<preferred-name>Prestigious University University</preferred-name>
<sort-name>Prestigious University</sort-name>
<org-domain>pu.edu</org-domain>
<org-URL>http://www.pu.edu/index.shtml</org-URL>
</ipdoc>
<address>
<authorid>1234567</authorid>
<address-part>1234 Prestigious Lane</address-part>
<city>City</city>
<state>ST</state>
<postal-code>12345</postal-code>
<country>United States</country>
</address>
</root>
[R另类
由于没有全面的[R XSLT库中存在,解析将不得不在R输入语言直接完成。但是,R可以通过命令行,RCOMClient包和其他接口调用其他可执行文件(即Python,Saxon,VBA)的XSLT处理器。
尽管如此,R可以为authorid
通过xmlToDataFrame()
和xpathSApply()
(后者类似XPath)提取XML数据:
library(XML)
coredata <- xmlToDataFrame(nodes = getNodeSet(doc, '//coredata'))
coredata$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
subjectareas <- xmlToDataFrame(nodes = getNodeSet(doc, "//subject-areas"))
subjectareas$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
authorname <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/preferred-name'))
authorname$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
classifications <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/classificationgroup/classifications'))
classifications$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
journal <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/journal-history/journal'))
journal$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
ipdoc <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc'))
ipdoc$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
address <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc/address'))
address$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
什么样的巫术...... 这是一个了不起的答案,谢谢你的透彻。所有的信息在一个,但有关系数据集与每一组信息分开,但在每个独特的标识符。 我会仔细阅读这一点,并尽我所能学习,除非你有其他想法,我会标记此作为答案很快 –
查看更新XSLT可以使用[variables](http://www.w3schools.com/xsl/el_variable.asp),它可以被传递到文档的其他部分甚至用[substring-after](http://zvon.org/xxl/XSLTreference/OutputOverview/function_substring-after_frame.html)函数解析'Author:'。所以'authorid'可以传入其他相关节点。事实上,我刚刚在这里了解到,SAS可以从一个XML导入多个表格!当然会将这个例子添加到我的图书馆。至于R,只需将'xmltodataframe'用于节点集并将'xmlSApply()'用作authorids。感谢您的问题! – Parfait
这可以很容易地用[XSLT](http://www.w3schools.com/xsl/)完成,该语言重新构造XML文档以适应任何细微的需求。如果[SAS](https://www.sas.com/en_us/home.html)指的是统计软件包,那么我们可以使用[proc xsl](http://support.sas.com/文档/ CDL/EN的/ proc/61895/HTML /默认/ viewer.htm#a003356144.htm)。请使用SAS标记此文件,并提供XML文档的实际样本和所需的数据集结果。 – Parfait
[Here](https://dl.dropboxusercontent.com/u/8428744/example_file.xml)是一个示例文件。我有超过11000个这样的文件,我用一个名为mergex.exe的程序将它们合并成一个大的XML文件。然后我使用SAS的XML映射器将XML文件导入SAS。非常方便,但XML文件的结构使得不可能将id链接到作者信息。理想情况下,我会让SAS中生成的每个数据集都包含作者ID(我使用的是从XML文件中抽取的as.numeric(sub(“AUTHOR_ID:”,“”,xmlValue(xml [“// dc:标识符“] [[1]])))' –