获取另一个数据框中特定值的计数

问题描述：

这个问题可能听起来与其他问题类似，但我希望它有所不同。我想采取一个具体的值列表，并计算它们在另一个值列表中出现的频率，其中不出现的值被重新调整为0。获取另一个数据框中特定值的计数

我有一个数据帧（DF1）具有以下值：包含一个名为 '东西' 列

Items <- c('Carrots','Plums','Pineapple','Turkey') 
df1<-data.frame(Items) 

>df1 
Items 
1 Carrots 
2  Plums 
3 Pineapple 
4 Turkey

和第二数据帧（DF2）：

> head(df2,n=10) 
    ID  Date  Thing 
1 58150 2012-09-12 Potatoes 
2 12357 2012-09-28 Turnips 
3 50788 2012-10-04 Oranges 
4 66038 2012-10-11 Potatoes 
5 18119 2012-10-11 Oranges 
6 48349 2012-10-14 Carrots 
7 23328 2012-10-16 Peppers 
8 66038 2012-10-26 Pineapple 
9 32717 2012-10-28 Turnips 
10 11345 2012-11-08 Oranges

我知道“土耳其”一词只出现在df1而不是df2中。我想返回频率表或df1中出现在df2中的项目的计数，并返回土耳其计数的“0”。

如何使用来自另一个值的数据框列来总结值？我得到的最接近是：

df2%>% count (Thing) %>% filter(Thing %in% df1$Items,)

但这回DF1和DF2之间的过滤项的列表，以便“土耳其”被排除在外。很近！

> df2%>% count (Thing) %>% filter(Thing %in% df1$Items,) 
# A tibble: 3 x 2 
     Thing  n 
    <fctr> <int> 
1 Carrots 30 
2 Pineapple 30 
3  Plums 38

我希望我的输出看起来像这样：

1 Carrots 30 
2 Pineapple 30 
3  Plums 38 
4 Turkey  0

我新望到R和完全新的dplyr。

答

我一直都在用这种东西。我相信有一个更精明的方式来编码，但这是我得到的：

item <- vector() 
count <- vector() 
items <- list(unique(df1$Items)) 

for (i in 1:length(items)){ 
    item[i] <- items[i] 
    count[i] <- sum(df2$Thing == item) 
} 

df3 <- data.frame(cbind(item, count))

希望这有助于！

感谢斯蒂芬，我收到了长度警告：'较长对象长度不短对象length' – gzrcm

啊，我想我知道为什么的倍数。所以上面的代码查看每个项目，而不仅仅是唯一的项目。我已经更新了我的评论。 –

我仍然收到同样的错误，但是我看到了你的脚本试图达到的目标。我创建的df1来自一个向量。有没有什么办法可以简化使用原始矢量的for循环？ – gzrcm

答

斯蒂芬的解决方案稍作修改，在count [i]行结尾添加[i]。请看下图：

item <- vector() 
count <- vector() 

for (i in 1:length(unique(Items))){ 
    item[i] <- Items[i] 
    count[i]<- sum(df2$Thing == item[i]) 
} 

df3 <- data.frame(cbind(item, count)) 

> df3 
     item count 
1 Carrots 30 
2  Plums 38 
3 Pineapple 30 
4 Turkey  0

答

dplyr降到0计数行，和你有更加复杂的是的Thing可能类别是你的两个数据集之间的不同。

如果添加因子水平从df1到df2，您可以使用complete从tidyr，这是add 0 count rows的常用方法。

我使用的是从包forcats称为fct_expand一个方便的功能附加从df1因子水平df2。

library(dplyr) 
library(tidyr) 
library(forcats) 

df2 %>% 
    mutate(Thing = fct_expand(Thing, as.character(df1$Item))) %>% 
    count(Thing) %>% 
    complete(Thing, fill = list(n = 0)) %>% 
    filter(Thing %in% df1$Items,)

谢谢aosmith！这也起作用。 – gzrcm

答

一种不同的方法是聚集df2第一，与df1右连接（挑df1所有行），并且通过零来替换NA。

library(dplyr) 
df2 %>% 
    count(Thing) %>% 
    right_join(unique(df1), by = c("Thing" = "Items")) %>% 
    mutate(n = coalesce(n, 0L))

# A tibble: 4 x 2 
     Thing  n 
     <chr> <int> 
1 Carrots  1 
2  Plums  0 
3 Pineapple  1 
4 Turkey  0 
Warning message: 
Column `Thing`/`Items` joining factors with different levels, coercing to character vector

在data.table相同的方法：

library(data.table) 
setDT(df2)[, .N, by = Thing][unique(setDT(df1)), on = .(Thing = Items)][is.na(N), N := 0L][]

 Thing N 
1: Carrots 1 
2:  Plums 0 
3: Pineapple 1 
4: Turkey 0

注意，在两个实现unique(df1)是用来避免意外重复连接后的行。

如果df2大，df1只包含几个Items它可能是更有效的加入，然后再汇总：

library(dplyr) 
df2 %>% 
    right_join(unique(df1), by = c("Thing" = "Items")) %>% 
    group_by(Thing) %>% 
    summarise(n = sum(!is.na(ID)))

# A tibble: 4 x 2 
     Thing  n 
     <chr> <int> 
1 Carrots  1 
2 Pineapple  1 
3  Plums  0 
4 Turkey  0 
Warning message: 
Column `Thing`/`Items` joining factors with different levels, coercing to character vector

同样在data.table syntax：

library(data.table) 
setDT(df2)[unique(setDT(df1)), on = .(Thing = Items)][, .(N = sum(!is.na(ID))), by = Thing][]

 Thing N 
1: Carrots 1 
2:  Plums 0 
3: Pineapple 1 
4: Turkey 0

谢谢Uwe！你的解决方案工作 – gzrcm

获取另一个数据框中特定值的计数

相关推荐