下一条记录的索引
我有一个自行车轨迹的样本数据集。我的目标是要弄清楚,平均的时间量,在访问B站间的失误下一条记录的索引
到目前为止,我已经能够简单地订购数据集:
test[order(test$starttime, decreasing = FALSE),]
,并找到哪里start_station
和end_station
相等B.
which(test$start_station == 'B')
which(test$end_station == 'B')
接下来的部分是,我遇到麻烦了行索引。为了计算的时间流逝中,当自行车是在站B之间,我们必须在那里start_station = "B"
(自行车叶)之间的difftime()
和下一个出现的记录其中end_station= "B"
,即使记录恰好是在同一行(见第6行)。
用下面的数据集,我们知道,自行车7:30:00
和16:00:00
外站B和18:00:00
以30分钟18:30:00
外站的B,19:00:00
210之间分钟,22:30:00
外站的B,之间花了510分钟这平均值为250 minutes.
如何使用difftime()
在R中重现此输出?
> test
bikeid start_station starttime end_station endtime
1 1 A 2017-09-25 01:00:00 B 2017-09-25 01:30:00
2 1 B 2017-09-25 07:30:00 C 2017-09-25 08:00:00
3 1 C 2017-09-25 10:00:00 A 2017-09-25 10:30:00
4 1 A 2017-09-25 13:00:00 C 2017-09-25 13:30:00
5 1 C 2017-09-25 15:30:00 B 2017-09-25 16:00:00
6 1 B 2017-09-25 18:00:00 B 2017-09-25 18:30:00
7 1 B 2017-09-25 19:00:00 A 2017-09-25 19:30:00
8 1 А 2017-09-25 20:00:00 C 2017-09-25 20:30:00
9 1 C 2017-09-25 22:00:00 B 2017-09-25 22:30:00
10 1 B 2017-09-25 23:00:00 C 2017-09-25 23:30:00
这里是样本数据:
> dput(test)
structure(list(bikeid = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), start_station = c("A",
"B", "C", "A", "C", "B", "B", "А", "C", "B"), starttime = structure(c(1506315600,
1506339000, 1506348000, 1506358800, 1506367800, 1506376800, 1506380400,
1506384000, 1506391200, 1506394800), class = c("POSIXct", "POSIXt"
), tzone = ""), end_station = c("B", "C", "A", "C", "B", "B",
"A", "C", "B", "C"), endtime = structure(c(1506317400, 1506340800,
1506349800, 1506360600, 1506369600, 1506378600, 1506382200, 1506385800,
1506393000, 1506396600), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("bikeid",
"start_station", "starttime", "end_station", "endtime"), row.names = c(NA,
-10L), class = "data.frame")
这将计算与要求在它发生的顺序不同,但它不追加到data.frame
lapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"])
[[1]]
Time difference of 510 mins
[[2]]
Time difference of 30 mins
[[3]]
Time difference of 210 mins
[[4]]
Time difference of NA mins
要计算平均时间:
v1 <- sapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"])
mean(v1, na.rm = TRUE)
[1] 250
谢谢,这个方法有效。你能简单地解释'function(x,et)'是如何工作的吗? –
'lapply'允许将多个参数传递给函数。 'x'的值是'starttime',而'et'是函数之后定义的附加参数。这是为了使参数只定义一次,但可以在函数中使用两次。 – manotheshark
另一种可能性:
library(data.table)
d <- setDT(test)[ , {
start = starttime[start_station == "B"]
end = endtime[end_station == "B"]
.(start = start, end = end, duration = difftime(end, start, units = "min"))
}
, by = .(trip = cumsum(start_station == "B"))]
d
# trip start end duration
# 1: 0 <NA> 2017-09-25 01:30:00 NA mins
# 2: 1 2017-09-25 07:30:00 2017-09-25 16:00:00 510 mins
# 3: 2 2017-09-25 18:00:00 2017-09-25 18:30:00 30 mins
# 4: 3 2017-09-25 19:00:00 2017-09-25 22:30:00 210 mins
# 5: 4 2017-09-25 23:00:00 <NA> NA mins
d[ , mean(duration, na.rm = TRUE)]
# Time difference of 250 mins
# or
d[ , mean(as.integer(duration), na.rm = TRUE)]
# [1] 250
的数据由它通过1各自行车从“B”(by = cumsum(start_station == "B")
)开始时间增加的计数器分组。
第一步将转换为长格式,如'library(data.table); mtest = melt(setDT(test),id =“bikeid”,meas = patterns(“_ station”,“time”), variable.name =“event”,value.name = c(“station”,“time” )); (factor:(1:2),c(“start”,“end”)),on =。(event),event:= i.V2]; 'setkey(mtest,bikeid,time)',但我不确定之后的最佳方式。 – Frank