如何识别和删除重复行,除了最近

问题描述:

我在HeidiSQL中工作,我试图找出如何删除除最近的所有重复行。 “重复”之间有一些细微的差别,但是每当有四个以上特定值相同(即UserID,ContactID,SMSID和EventID)时,该行就被认为是重复的。我需要根据最近的行(由CreatedDate标识)删除这些。如何识别和删除重复行,除了最近

下面的查询识别这些行:

SELECT a.UserID, a.ContactID, a.SMSID, a.EventID, CreatedDate 
FROM WhenToText a 
JOIN (SELECT UserID, ContactID, SMSID, EventID 
     FROM WhenToText 
     GROUP BY UserID, ContactID, SMSID, EventID 
     HAVING COUNT(*) > 1) b 
ON a.UserID = b.UserID 
AND a.ContactID = b.ContactID 
AND a.SMSID = b.SMSID 
AND a.EventID = b.EventID 
ORDER BY UserID, ContactID, SMSID, EventID, CreatedDate DESC 

但是,我不知道我已经确定了他们后如何删除这些重复。

下面是一些样本数据:

enter image description here

+0

当你说'“根据最近的一排”'你的意思是,在在重复的情况下,您想要保留最近的记录吗?你能向我们展示一些样本数据吗? –

+0

是的,这是正确的。我只想保留最近的重复。所以我对最新的CreatedDate感兴趣。我在原始文章中添加了一些示例数据的屏幕截图。再次感谢。 – David

这里有一个办法:

DELETE FROM WhenToText w1 
INNER JOIN 
(
    SELECT UserID, ContactID, SMSID, EventID, MAX(CreatedDate) AS MaxDate 
    FROM WhenToText 
    GROUP BY UserID, ContactID, SMSID, EventID 
) w2 
    ON w1.UserID = w2.UserID AND w1.ContactID = w2.ContactID AND w1.SMSID = w2.SMSID 
     AND w1.EventID = w2.EventID 
     AND w1.CreatedDate != w2.MaxDate 

这将删除CreatedDate不是最新的给定(UserID, ContactID, SMSID, EventID)组的任何记录。请记住,如果共享最新的CreatedDate,这可能会为每个组留下多个记录。

如果要首先测试哪个查询以查看哪些记录将作为删除目标,那么可以用SELECT w1.* FROM WhenToText w1替换DELETE FROM WhenToText w1

这里是一个SQL小提琴一个链接,演示了查询如何确定删除记录:

SQLFiddle

+1

非常感谢,这正是我一直在寻找的。 – David

下面是一个使用DELETE FROM JOIN,W /与您的数据的完整演示解决方案。

SQL:

-- Data preparation 
create table WhenToText(UserID int, ContactID int, SMSID int, EventID int, CreatedDate datetime); 
insert into WhenToText values 
    (4, 25, 7934, 7407, '2016-02-10 00:00:11'), 
    (4, 25, 7934, 7407, '2016-02-09 00:00:12'), 
    (4, 29, 5132, 7407, '2016-02-10 00:00:11'), 
    (4, 29, 5132, 7407, '2016-02-09 00:00:12'), 
    (4, 31, 12944, 7405, '2016-02-10 07:03:02'), 
    (4, 31, 12944, 7405, '2016-02-10 05:03:02'), 
    (4, 146, 12908, 7405, '2016-02-10 06:52:02'), 
    (4, 146, 12908, 7405, '2016-02-10 04:52:02'), 
    (15, 63, 12964, 7401, '2016-02-10 03:42:04'), 
    (15, 63, 12964, 7401, '2016-02-10 03:41:04'), 
    (15, 64, 12326, 7401, '2016-02-07 03:01:03'), 
    (15, 64, 12326, 7401, '2016-02-07 03:00:03'); 
SELECT * FROM WhenToText; 

-- SQL needed 
DELETE a FROM 
    WhenToText a INNER JOIN 
    (
    SELECT UserID, ContactID, SMSID, EventID, MAX(CreatedDate) CreatedDate 
    FROM WhenToText 
    GROUP BY UserID, ContactID, SMSID, EventID 
    ) b 
    USING(UserID, ContactID, SMSID, EventID) 
WHERE 
    a.CreatedDate != b.CreatedDate; 

SELECT * FROM WhenToText; 

输出:

mysql> SELECT * FROM WhenToText; 
+--------+-----------+-------+---------+---------------------+ 
| UserID | ContactID | SMSID | EventID | CreatedDate   | 
+--------+-----------+-------+---------+---------------------+ 
|  4 |  25 | 7934 | 7407 | 2016-02-10 00:00:11 | 
|  4 |  25 | 7934 | 7407 | 2016-02-09 00:00:12 | 
|  4 |  29 | 5132 | 7407 | 2016-02-10 00:00:11 | 
|  4 |  29 | 5132 | 7407 | 2016-02-09 00:00:12 | 
|  4 |  31 | 12944 | 7405 | 2016-02-10 07:03:02 | 
|  4 |  31 | 12944 | 7405 | 2016-02-10 05:03:02 | 
|  4 |  146 | 12908 | 7405 | 2016-02-10 06:52:02 | 
|  4 |  146 | 12908 | 7405 | 2016-02-10 04:52:02 | 
|  15 |  63 | 12964 | 7401 | 2016-02-10 03:42:04 | 
|  15 |  63 | 12964 | 7401 | 2016-02-10 03:41:04 | 
|  15 |  64 | 12326 | 7401 | 2016-02-07 03:01:03 | 
|  15 |  64 | 12326 | 7401 | 2016-02-07 03:00:03 | 
+--------+-----------+-------+---------+---------------------+ 
12 rows in set (0.00 sec) 

mysql> 
mysql> -- SQL needed 
mysql> DELETE a FROM 
    ->  WhenToText a INNER JOIN 
    ->  (
    ->  SELECT UserID, ContactID, SMSID, EventID, MAX(CreatedDate) CreatedDate 
    ->  FROM WhenToText 
    ->  GROUP BY UserID, ContactID, SMSID, EventID 
    ->  ) b 
    ->  USING(UserID, ContactID, SMSID, EventID) 
    -> WHERE 
    ->  a.CreatedDate != b.CreatedDate; 

SELECT * FQuery OK, 6 rows affected (0.00 sec) 

mysql> 
mysql> SELECT * FROM WhenToText; 
+--------+-----------+-------+---------+---------------------+ 
| UserID | ContactID | SMSID | EventID | CreatedDate   | 
+--------+-----------+-------+---------+---------------------+ 
|  4 |  25 | 7934 | 7407 | 2016-02-10 00:00:11 | 
|  4 |  29 | 5132 | 7407 | 2016-02-10 00:00:11 | 
|  4 |  31 | 12944 | 7405 | 2016-02-10 07:03:02 | 
|  4 |  146 | 12908 | 7405 | 2016-02-10 06:52:02 | 
|  15 |  63 | 12964 | 7401 | 2016-02-10 03:42:04 | 
|  15 |  64 | 12326 | 7401 | 2016-02-07 03:01:03 | 
+--------+-----------+-------+---------+---------------------+ 
6 rows in set (0.00 sec) 
+0

这似乎是一个很好的方法。在执行它之前有没有一种很好的方法来测试它?我尝试将它作为SELECT * FROM运行,试图获得所有将被删除但无法使其正常工作的行的返回。任何想法?再次感谢! – David

+0

@David根据你的新数据更新了帖子。请再试一次。 –

+0

优秀,这个作品!感谢您的帮助,非常感谢。只是好奇 - 是否有办法自动化CREATE TABLE表的输入,还是必须手动完成?理想情况下,我希望能够直接查询这一步,并避免必须手动输入。 – David

这应该提供您正在寻找的解决办法,因为CreatedDate是一个日期数据类型。这也是基于最近的行在技术上是最近的CreatedDate的假设。

SELECT UserID, ContactID, SMSID, EventID, MAX(CreatedDate) AS CreatedDate 
FROM WhenToText 
GROUP BY 1, 2, 3, 4; 

有了这些值,你可以只覆盖WhenToText表...这将是这个样子......

CREATE TABLE tmp_table LIKE WhenToText; 

INSERT INTO tmp_table (SELECT UserID, ContactID, SMSID, EventID, MAX(CreatedDate) AS CreatedDate 
          FROM WhenToText 
          GROUP BY 1, 2, 3, 4); 

TRUNCATE WhenToText; 

INSERT INTO WhenToText (SELECT * FROM tmp_table); 

DROP TABLE tmp_table; 
+0

我只是试着用这个,它一直说我不能“在CreatedDate上分组”。 – David

+0

另外,我试着做一些调整,比如用ORDER BY替换GROUP BY,并且查询只返回1个单行(即MAX(Cre​​atedDate)。) – David

+0

1st问:@David不知道如何得到GROUP BY错误,因为如果它们是select语句中的前4个项目,那么1,2,3,4只指向项目(UserID,ContactID,SMSID,EventID)。第二个问题:您需要GROUP BY语句正确地运行聚集MAX函数...使用聚合引入ORDER BY子句而不是GROUP BY聚合,只返回一行 – TomDobbs