BIGQUERY选择不重复的记录
问题描述:
考虑下表(简本):BIGQUERY选择不重复的记录
id int,
amount decimal,
transaction_no,
location_id int,
created_at datetime
上面的架构用于存储POS收据餐厅。现在,此表有时包含同一日期的收据,相同transaction_no at same location_id。
在这种情况下,我想要做的就是最后收到的是为了通过created_at递减 LOCATION_ID & transaction_no。
在MySQL中,我使用下面的查询这让我最后一个(max(created_at)
收到一个LOCATION_ID & transaction_no:BigQuery中的
SELECT id, amount, transaction_no, location_id, created_at
FROM receipts r JOIN
(SELECT transaction_no, max(created_at) AS maxca
FROM receipts r
GROUP BY transaction_no
) t
ON r.transaction_no = t.transaction_no AND r.created_at = t.maxca
group by location_id;
但是当我运行相同的,我得到以下错误:
Query Failed Error: Shuffle reached broadcast limit for table __I0 (broadcasted at least 150393576 bytes). Consider using partitioned joins instead of broadcast joins . Job ID: circular-gist-812:job_A_CfsSKJICuRs07j7LHVbkqcpSg
任何想法如何使上述查询在BigQuery中工作?
答
SELECT id, amount, transaction_no, location_id, created_at
FROM (
SELECT
id, amount, transaction_no, location_id, created_at,
ROW_NUMBER() OVER(PARTITION BY transaction_no, location_id
ORDER BY created_at DESC) as last
FROM your_dataset.your_table
)
WHERE last = 1
这给了我扁平化的输出。有没有一种方法可以保持嵌套结构?我已经在[bigquery web ui](https://www.evernote.com/l/ACRfvDG5ERpJPLBZ9ooZpOOIb_WKFvhEs6wB/image.png)中取消选中了Flatten结果选项 – CuriousMind