R中的一个数据帧子集

问题描述:

我有2个数据帧df2DFR中的一个数据帧子集

> DF 
     date tickers 
1 2000-01-01  B 
2 2000-01-01 GOOG 
3 2000-01-01  V 
4 2000-01-01 YHOO 
5 2000-01-02  XOM 

> df2 
     date tickers quantities 
1 2000-01-01  BB   11 
2 2000-01-01  XOM   23 
3 2000-01-01 GOOG   42 
4 2000-01-01 YHOO   21 
5 2000-01-01  V  2112 
6 2000-01-01  B   13 
7 2000-01-02  XOM   24 
8 2000-01-02  BB  422 

我需要从df2那些存在于DF的值。这意味着我需要以下的输出:

3 2000-01-01 GOOG   42 
4 2000-01-01 YHOO   21 
5 2000-01-01  V  2112 
6 2000-01-01  B   13 
7 2000-01-02  XOM   24 

所以我用下面的代码:

> subset(df2,df2$date %in% DF$date & df2$tickers %in% DF$tickers) 
     date tickers quantities 
2 2000-01-01  XOM   23 
3 2000-01-01 GOOG   42 
4 2000-01-01 YHOO   21 
5 2000-01-01  V  2112 
6 2000-01-01  B   13 
7 2000-01-02  XOM   24 

但输出包含一个额外的column.That是因为ticker“XOM”存在2天在df2。所以两行都被选中。我的代码需要进行哪些修改?

的dput如下:

> dput(DF) 
structure(list(date = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("2000-01-01", 
"2000-01-02"), class = "factor"), tickers = structure(c(4L, 5L, 
6L, 8L, 7L), .Label = c("A", "AA", "AAPL", "B", "GOOG", "V", 
"XOM", "YHOO", "Z"), class = "factor")), .Names = c("date", "tickers" 
), row.names = c(NA, -5L), class = "data.frame") 
> dput(df2) 
structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L), .Label = c("2000-01-01", "2000-01-02"), class = "factor"), 
    tickers = structure(c(2L, 5L, 3L, 6L, 4L, 1L, 5L, 2L), .Label = c("B", 
    "BB", "GOOG", "V", "XOM", "YHOO"), class = "factor"), quantities = c(11, 
    23, 42, 21, 2112, 13, 24, 422)), .Names = c("date", "tickers", 
"quantities"), row.names = c(NA, -8L), class = "data.frame") 
+0

你想对重复行做什么?只取一个,将它们相加,将值作为单独的列返回......? – Thomas 2013-05-06 12:14:22

+1

你只是在寻找'merge(DF,df2)'...?这与'sqldf'的答案在下面给出了相同的答案... – 2013-05-06 12:24:24

+0

我认为merge()仅适用于具有相同列数的数据帧。这就是为什么我问这个问题。感谢您的帮助。 – 2013-05-06 12:30:37

其实并非如此不同from my answer to this post of yours,但需要稍加修改:

df2[duplicated(rbind(DF, df2[,1:2]))[-seq_len(nrow(DF))], ] 

#   date tickers quantities 
# 3 2000-01-01 GOOG   42 
# 4 2000-01-01 YHOO   21 
# 5 2000-01-01  V  2112 
# 6 2000-01-01  B   13 
# 7 2000-01-02  XOM   24 

注意:这为输出提供了与我们相同的顺序的行重新在df2


替换地,如本说明,使用merge

merge(df2, DF, by=c("date", "tickers")) 

将给出相同的结果,以及(但不一定以相同的顺序)。

使用sqldf包:

require(sqldf) 

sqldf("SELECT d2.date, d2.tickers, d2.quantities FROM df2 d2 
     JOIN DF d1 ON d1.date=d2.date AND d1.tickers=d2.tickers") 

##  date tickers quantities 
## 1 2000-01-01 GOOG   42 
## 2 2000-01-01 YHOO   21 
## 3 2000-01-01  V  2112 
## 4 2000-01-01  B   13 
## 5 2000-01-02  XOM   24