BigQuery中重叠间隔的数量
问题描述:
给定一个间隔表,我可以有效地查询每个间隔开始时的当前打开间隔数(包括当前间隔本身)吗?BigQuery中重叠间隔的数量
例如,下表给出:
start_time end_time 1 10 2 5 3 4 5 6 7 11 19 20
我想下面的输出:
start_time count 1 1 2 2 3 3 5 3 7 2 19 1
在小数据集,我可以对自己加入该数据集解决这个问题:
WITH intervals AS (
SELECT 1 AS start, 10 AS end UNION ALL
SELECT 2, 5 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 7, 11 UNION ALL
SELECT 19, 20
)
SELECT
a.start_time,
count(*)
FROM
intervals a CROSS JOIN intervals b
WHERE
a.start_time >= b.start_time AND
a.start_time <= b.end_time
GROUP BY a.start_time
ORDER BY a.start_time
对于大型数据集,CROSS JOIN既不切实际也不必要,因为ny给出答案只取决于少数前面的区间(当按start_time
排序时)。事实上,在我拥有的数据集中,它超时。有没有更好的方法来实现这一目标?
答
... CROSS JOIN既不切实际也不必要...
有没有更好的方法来实现这个目标?
请尝试下面的BigQuery标准SQL。没有加入参与
#standardSQL
SELECT
start_time,
(SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt
FROM (
SELECT
start_time,
ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
FROM intervals
)
-- ORDER BY start_time
您可以测试/使用下面的例子发挥它从你的问题
#standardSQL
WITH intervals AS (
SELECT 1 AS start_time, 10 AS end_time UNION ALL
SELECT 2, 5 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 7, 11 UNION ALL
SELECT 19, 20
)
SELECT
start_time,
(SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt
FROM (
SELECT
start_time,
ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
FROM intervals
)
-- ORDER BY start_time
+0
@BrandonDuRette - 你有机会尝试吗? –
u能解释输出虚拟数据? – Teja
输出是从输入开始的每个时间间隔的开始时间以及在该时间间隔的开始时间的开放时间间隔(开始时间 =那个时间)的计数。 –