R：xmlEventParse与大的，变节点的XML输入和转换到数据帧

问题描述：

我有发布数据的〜100个XML文件中的每个> 10GB格式是这样的：R：xmlEventParse与大的，变节点的XML输入和转换到数据帧

<?xml version="1.0" encoding="UTF-8"?> 
<records xmlns="http://website”> 
<REC rid=“this is a test”> 
    <UID>ABCD123</UID> 
    <data_1> 
     <fullrecord_metadata> 
      <references count=“3”> 
       <reference> 
        <uid>ABCD2345</uid> 
       </reference> 
       <reference> 
        <uid>ABCD3456</uid> 
       </reference> 
       <reference> 
        <uid>ABCD4567</uid> 
       </reference> 
      </references> 
     </fullrecord_metadata> 
    </data_1> 
</REC> 
<REC rid=“this is a test”> 
    <UID>XYZ0987</UID> 
    <data_1> 
     <fullrecord_metadata> 
      <references count=“N”> 
      </references> 
     </fullrecord_metadata> 
    </data_1> 
</REC> 
</records>

，与引用的数量变化对于每个唯一条目（由UID索引），其中一些可能为零。

的目标：创建每个XML文件1个简单data.frame如下 -

UID  reference 
ABCD123 ABCD2345 
ABCD123 ABCD3456 
ABCD123 ABCD4567 
XYZ0987 NULL

由于文件的大小和需要高效循环的过许多文件，我一直在探索xmlEventParse要限制使用内存。我可以成功地提取关键独特的“UID” S每个“REC”，用从之前的问题下面的代码创建一个data.frame：

branchFunction <- function() { 
store <- new.env() 
func <- function(x, ...) { 
ns <- getNodeSet(x, path = "//UID") 
key <- xmlValue(ns[[1]]) 
value <- xmlValue(ns[[1]]) 
print(value) 
store[[key]] <- value 
} 
getStore <- function() { as.list(store) } 
list(UID = func, getStore=getStore) 
} 

myfunctions <- branchFunction() 

xmlEventParse(
    file = "test.xml", 
    handlers = NULL, 
    branches = myfunctions 
) 

DF <- do.call(rbind.data.frame, myfunctions$getStore())

但我不能成功地存储参考的数据，也没有处理的变化单个UID的参考号码。感谢您的任何建议！

答

设置一个函数，该函数将为我们的元素数据创建一个临时存储区域，以及每次找到时会调用的函数。

library(XML) 

uid_traverse <- function() { 

    # we'll store them as character vectors and then make a data frame out of them. 
    # this is likely one of the cheapest & fastest methods despite growing a vector 
    # inch by inch. You can pre-allocate space and modify this idiom accordingly 
    # for another speedup. 

    uids <- c() 
    refs <- c() 

    REC <- function(x) { 

    uid <- xpathSApply(x, "//UID", xmlValue) 
    ref <- xpathSApply(x, "//reference/uid", xmlValue) 

    if (length(uid) > 0) { 

     if (length(ref) == 0) { 

     uids <<- c(uids, uid) 
     refs <<- c(refs, NA_character_) 

     } else { 

     uids <<- c(uids, rep(uid, length(ref))) 
     refs <<- c(refs, ref) 

     } 

    } 

    } 

    # we return a named list with the element handler and another 
    # function that turns the vectors into a data frame 

    list(
    REC = REC, 
    uid_df = function() { 
     data.frame(uid = uids, ref = refs, stringsAsFactors = FALSE) 
    } 
) 

}

我们需要这个函数的一个实例。

uid_f <- uid_traverse()

现在，我们称之为xmlEventParse（），并给它提供的功能，用无形的（），因为我们不需要什么xmlEventParse（）返回，但只想副作用：

invisible(
    xmlEventParse(
    file = path.expand("~/data/so.xml"), 
    branches = uid_f["REC"]) 
)

并且，我们看到结果：

uid_f$uid_df() 
##  uid  ref 
## 1 ABCD123 ABCD2345 
## 2 ABCD123 ABCD3456 
## 3 ABCD123 ABCD4567 
## 4 XYZ0987  <NA>

太棒了。这是非常有效的，并且也容易修改用于提取其他数据部分。谢谢hrbrmstr！ – km5041

我遇到内存限制与全尺寸的文件，并试图修改代码间歇输出数据帧write.csv他们，并保持记忆力。但是，我无法弄清楚如何在运行中清除载体。我试过测试“uids”的长度，然后写入一个csv的数据框，如果它超出了一定的限制，然后NULLing“uids”和“vars”，然而随着函数的进展，某些东西会一直保存在内存中。但我无法弄清楚什么......想法？ – km5041

有关内存问题的更多信息 - 即使我在中途停止函数并删除所有对象（包括隐藏）和垃圾回收，内存仍然显示为由R使用。内存不会被释放，直到R为止退出。这个过程是否有一个对象被创建，即使“rm”不能删除？ – km5041

R：xmlEventParse与大的，变节点的XML输入和转换到数据帧

相关推荐