子集数据帧只包含一个因子在另一个因子的两个水平都有值的水平
问题描述:
我正在处理数字测量的数据框。有些人已经多次测量过,包括青少年和成年人。 可再现例如:子集数据帧只包含一个因子在另一个因子的两个水平都有值的水平
ID <- c("a1", "a2", "a3", "a4", "a1", "a2", "a5", "a6", "a1", "a3")
age <- rep(c("juvenile", "adult"), each=5)
size <- rnorm(10)
# e.g. a1 is measured 3 times, twice as a juvenile, once as an adult.
d <- data.frame(ID, age, size)
我的目标是通过选择至少出现一次作为少年和至少一次作为一个成年人的ID到子集的数据帧。不知道该怎么做..?
生成的数据框将包含个人a1,a2和a3的所有测量结果,但会排除a4,a5和a6,因为它们在两个阶段均未测量。
类似的问题被问7个月前,但从来没有一个答案(Subset data frame to include only levels one factor that have values in both levels of another factor)
谢谢!
答
这里是data.table
library(data.table)
setDT(d)[, .SD[all(c("juvenile", "adult") %in% age)], ID]
一个选项0
或用ave
d[with(d, ave(as.character(age), ID, FUN = function(x) length(unique(x)))>1),]
# ID age size
#1 a1 juvenile -1.4545407
#2 a2 juvenile -0.4695317
#3 a3 juvenile 0.2271316
#5 a1 juvenile 0.2961210
#6 a2 adult -0.8331993
#9 a1 adult -0.6924967
#10 a3 adult -0.4619550
一个
base R
选项
答
随着dplyr
,您可以使用group_by %>% filter
:
library(dplyr)
d %>% group_by(ID) %>% filter(all(c("juvenile", "adult") %in% age))
# A tibble: 7 x 3
# Groups: ID [3]
# ID age size
# <fctr> <fctr> <dbl>
#1 a1 juvenile -0.6947697
#2 a2 juvenile -0.3665272
#3 a3 juvenile 1.0293555
#4 a1 juvenile 0.2745224
#5 a2 adult 0.5299029
#6 a1 adult 2.2247802
#7 a3 adult -0.4717160
答
split
通过age
,intersect
和子集:
d[d$ID %in% Reduce(intersect, split(d$ID, d$age)),]
# ID age size
#1 a1 juvenile 1.44761836
#2 a2 juvenile 1.70098645
#3 a3 juvenile 0.08231986
#5 a1 juvenile 0.91240568
#6 a2 adult -1.77318962
#9 a1 adult 0.13597986
#10 a3 adult -1.18575294