找到间隔可能重叠的向量中值的间隔索引
问题描述:
我想要找到属于由结束值向量和1)“回溯”值间隔定义的间隔的向量中的值的索引和2)前N个值。找到间隔可能重叠的向量中值的间隔索引
假设我有
x <- c(1,3,4,5,7,8,9,10,13,14,15,16,17,18) #the vector of interest
v_end <- c(5, 7, 15) #the end values
l<-3 #look-back value interval
N<-3 #number of value to look back
我要的是下面的输出的第二和第三列。
x i n
[1,] 1 0 1
[2,] 3 1 1
[3,] 4 1 1
[4,] 5 1 1
[5,] 7 1 1
[6,] 8 0 0
[7,] 9 0 0
[8,] 10 0 1
[9,] 13 1 1
[10,] 14 1 1
[11,] 15 1 1
[12,] 16 0 0
[13,] 17 0 0
[14,] 18 0 0
请注意,v_end和l导致三个区间[2,5],[4,7],[12,15]。 [2,5]和[4,7]有重叠,本质上是[2,7]。 而且,v_end和l导致三个区间[1,5],[3,7],[10,15]。再次有重叠。
该任务与函数findInterval {base}类似,但无法通过它解决。
答
具有有序 “V_END” 和 “x”(用于 “N” 的情况下),对于 “L” 的情况下的间隔是:
ints = cbind(start = v_end - l, end = v_end)
ints
# start end
#[1,] 2 5
#[2,] 4 7
#[3,] 12 15
它们重叠可以与被分组:
overlap_groups = cumsum(c(TRUE, ints[-nrow(ints), "end"] < ints[-1, "start"]))
其可用于以减少重叠的间隔:
group_end = cumsum(rle(overlap_groups)$lengths)
group_start = c(1L, group_end [-length(group_end)] + 1L)
ints2 = cbind(start = ints[group_start, "start"], end = ints[group_end, "end"])
ints2
# start end
#[1,] 2 7
#[2,] 12 15
然后,使用findInterval
:
istart = findInterval(x, ints2[, "start"])
iend = findInterval(x, ints2[, "end"], left.open = TRUE)
i = as.integer((istart - iend) == 1L)
i
# [1] 0 1 1 1 1 0 0 0 1 1 1 0 0 0
对于 “N” 的情况下,先从:
ints = cbind(start = x[match(v_end, x) - N], end = v_end)
ints
# start end
#[1,] 1 5
#[2,] 3 7
#[3,] 10 15
并按照上面的步骤,我们得到:
#.....
n = as.integer((istart - iend) == 1L)
n
# [1] 1 1 1 1 1 0 0 1 1 1 1 0 0 0
一般来说,对于这样操作的方便工具在这里,“IRanges”包使得这种方法简单明了:
library(IRanges)
xrng = IRanges(x, x)
i = as.integer(overlapsAny(xrng, reduce(IRanges(v_end - l, v_end), min.gapwidth = 0)))
i
# [1] 0 1 1 1 1 0 0 0 1 1 1 0 0 0
n = as.integer(overlapsAny(xrng, reduce(IRanges(x[match(v_end, x) - N], v_end), min.gapwidth = 0)))
n
# [1] 1 1 1 1 1 0 0 1 1 1 1 0 0 0