子集数据帧只包含一个因子在另一个因子的两个水平都有值的水平

子集数据帧只包含一个因子在另一个因子的两个水平都有值的水平

问题描述:

我正在处理数字测量的数据框。有些人已经多次测量过,包括青少年和成年人。 可再现例如:子集数据帧只包含一个因子在另一个因子的两个水平都有值的水平

ID <- c("a1", "a2", "a3", "a4", "a1", "a2", "a5", "a6", "a1", "a3") 
age <- rep(c("juvenile", "adult"), each=5) 
size <- rnorm(10) 

# e.g. a1 is measured 3 times, twice as a juvenile, once as an adult. 
d <- data.frame(ID, age, size) 

我的目标是通过选择至少出现一次作为少年和至少一次作为一个成年人的ID到子集的数据帧。不知道该怎么做..?

生成的数据框将包含个人a1,a2和a3的所有测量结果,但会排除a4,a5和a6,因为它们在两个阶段均未测量。

类似的问题被问7个月前,但从来没有一个答案(Subset data frame to include only levels one factor that have values in both levels of another factor

谢谢!

这里是data.table

library(data.table) 
setDT(d)[, .SD[all(c("juvenile", "adult") %in% age)], ID] 
一个选项0

或用ave

d[with(d, ave(as.character(age), ID, FUN = function(x) length(unique(x)))>1),] 
# ID  age  size 
#1 a1 juvenile -1.4545407 
#2 a2 juvenile -0.4695317 
#3 a3 juvenile 0.2271316 
#5 a1 juvenile 0.2961210 
#6 a2 adult -0.8331993 
#9 a1 adult -0.6924967 
#10 a3 adult -0.4619550 
一个 base R选项

随着dplyr,您可以使用group_by %>% filter

library(dplyr) 
d %>% group_by(ID) %>% filter(all(c("juvenile", "adult") %in% age)) 

# A tibble: 7 x 3 
# Groups: ID [3] 
#  ID  age  size 
# <fctr> <fctr>  <dbl> 
#1  a1 juvenile -0.6947697 
#2  a2 juvenile -0.3665272 
#3  a3 juvenile 1.0293555 
#4  a1 juvenile 0.2745224 
#5  a2 adult 0.5299029 
#6  a1 adult 2.2247802 
#7  a3 adult -0.4717160 

split通过ageintersect和子集:

d[d$ID %in% Reduce(intersect, split(d$ID, d$age)),] 
# ID  age  size 
#1 a1 juvenile 1.44761836 
#2 a2 juvenile 1.70098645 
#3 a3 juvenile 0.08231986 
#5 a1 juvenile 0.91240568 
#6 a2 adult -1.77318962 
#9 a1 adult 0.13597986 
#10 a3 adult -1.18575294