从不同长度的分级列表中提取数据到`data.frame`使用`purr`

问题描述:

这是一个直接跟进到以前的类似问题,我问的提取列表的列表的特定子集:Extracting data from a list of lists into its own `data.frame` with `purrr`从不同长度的分级列表中提取数据到`data.frame`使用`purr`

因此,我将使用相同的样本数据集:

l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T", 
        d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr", 
             score = -0.21104594634643), .Names = c("id", "label", "link", "score")), e = 49.1279871269422), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.934821052832427, b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Scoresbysund", score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", "label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", c = "P", d = list(structure(list(id = 8L, label = "Georgia", link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 2L, label = "Washington", link = "America/Shiprock", score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 6L, label = "North Dakota", link = "Universal", score = 1.03168296038975), .Names = c("id", "label", "link", "score")), structure(list(id = 1L, label = "New Hampshire", link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", "label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", "label", "link", "score"))), e = 132.1153538536), .Names = c("a", "e")), structure(list(a = 0.202685974077313, b = "x", c = "O", d = structure(list(id = 3L, label = "Delaware", link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", "label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.396243444741009, b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Ojinaga", score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", "label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", "b", "c", "d", "e"))) 

我想解决的普遍问题是提取这是长短不一的嵌套列表的内容,并让它们在同一个列表中绑定到其他内容基本上被用作嵌套内容的ID。

在上述样本数据集的上下文中,我试图子列表d的内容提取到一个data.table/data.frame,而且提取和基本上重复数据在a每个元素 - 让我可以理解由于它们的长度不同,所提取的d中的元素属于相同的子集。所需data.table的例子将最好地解释:

a   id   label      link  score externalId 
-1.5467647 5   Utah     Asia/Anadyr -0.2110459   NA 
-0.9348211 8 South Carolina    Pacific/Wallis 0.5265409 -6.743544 
-0.9348211 9  Nebraska  America/Scoresbysund 0.2508955 16.42575 

注意,第一列a是内l第一子列表中的内容。第一行是d(长度1)中第一个嵌套项目的内容,则第二和第三行是d(长度2)中第二项的内容,因此a中的值与-0.9348211相同。

目前我的解决方案是以圆满的方式完成的,并且容易出错 - 并且考虑到与上面引用的帖子的关系,我想知道我是否不理解能够扩展它的解决方案到这个相关的问题。

每个嵌套列表往往需要一个稍微不同的方法,但这种涵盖了一些典型的:

library(tidyverse) 

l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T", 
        d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr", 
             score = -0.21104594634643), .Names = c("id", "label", "link", "score")), e = 49.1279871269422), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.934821052832427, b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Scoresbysund", score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", "label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", c = "P", d = list(structure(list(id = 8L, label = "Georgia", link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 2L, label = "Washington", link = "America/Shiprock", score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 6L, label = "North Dakota", link = "Universal", score = 1.03168296038975), .Names = c("id", "label", "link", "score")), structure(list(id = 1L, label = "New Hampshire", link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", "label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", "label", "link", "score"))), e = 132.1153538536), .Names = c("a", "e")), structure(list(a = 0.202685974077313, b = "x", c = "O", d = structure(list(id = 3L, label = "Delaware", link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", "label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.396243444741009, b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Ojinaga", score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", "label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", "b", "c", "d", "e"))) 

l %>% 
    map(set_names, letters[1:5]) %>% # add missing names 
    map(modify_at, 'd', bind_rows) %>% # coerce nested elements to data.frame 
    # make each element to a data.frame, and rbind them all together 
    map_df(data.frame, stringsAsFactors = FALSE) 
#>    a b c d.id  d.label    d.link d.score   e d.externalId 
#> 1 -1.5467647 s T 5   Utah   Asia/Anadyr -0.2110459 49.12799   NA 
#> 2 -0.9348211 k T 8 South Carolina  Pacific/Wallis 0.5265409 52.31614 -6.743544 
#> 3 -0.9348211 k T 9  Nebraska America/Scoresbysund 0.2508955 52.31614 16.425747 
#> 4 -0.2726149 f P 8  Georgia   America/Nome 0.5264941 132.11535  7.915836 
#> 5 -0.2726149 f P 2  Washington  America/Shiprock -0.5551864 132.11535 15.068666 
#> 6 -0.2726149 f P 6 North Dakota   Universal 1.0316830 132.11535   NA 
#> 7 -0.2726149 f P 1 New Hampshire  America/Cordoba 1.2158206 132.11535  9.727642 
#> 8 -0.2726149 f P 1   Alaska  Asia/Istanbul -0.2318326 132.11535   NA 
#> 9 -0.2726149 f P 4 Pennsylvania Africa/Dar_es_Salaam 0.5902453 132.11535   NA 
#> 10 0.2026860 x O 3  Delaware  Asia/Samarkand 0.6955771 97.99089 15.236482 
#> 11 -0.3962434 z P 4 North Dakota  America/Tortola 1.0306027 123.59795 -7.216669 
#> 12 -0.3962434 z P 9  Nebraska  America/Ojinaga -1.1139800 123.59795 -8.451451 

还有更多的方法可以做到这一点,但关键是通过安排最开始嵌套元素转换为适当的数据结构,然后将它们与剩余的元素结合,直到获得data.frame。

请注意,使用data.frame而不是tibble等价物在这里有点不方便,但data.frame在将数据框架和值放入单个数据框架中并在必要时进行回收要好得多。使用翻版版本需要把所有东西都做成正确的长度,而不是依靠回收。

+0

上面的内容对我来说确实很有帮助,并且将我的实际数据用于解决方案仍然不完整。这是由于实际数据与上面的示例数据的差异所致,带有诸如_“参数意味着行数不同:1,2,7”的错误 - - 我在想这可能是由于列表中我不感兴趣的部分,例如,如果我想排除'b','c'和'e',我怎样才能使用上面的方法来关注'a'和'd'? – daRknight

+0

'purrr :: keep'是一个选项 – alistaire

+0

我在寻找'purrr :: keep'的帮助,看起来这是由一个逻辑或谓词驱动的,我将如何利用基于数字“a”的函数,甚至说'b'是人物? – daRknight