从perl中的多个文本文件中删除重复项？

问题描述：

我是这个网站的新手，需要帮助从多个文本文件中删除重复的条目（在一个循环中）。尝试下面的代码，但这并不是删除多个文件的重复，但它是为单个文件工作。从perl中的多个文本文件中删除重复项？

代码：

my $file = "$Log_dir/File_listing.txt"; 
my $outfile = "$Log_dir/Remove_duplicate.txt";; 

open (IN, "<$file") or die "Couldn't open input file: $!"; 
open (OUT, ">$outfile") or die "Couldn't open output file: $!"; 
my %seen =(); 
{ 
    my @ARGV = ($file); 
    # local $^I = '.bac'; 
    while(<IN>){ 
    print OUT $seen{$_}++; 
    next if $seen{$_} > 1; 
    print OUT ; 
    } 
}

感谢，艺术

尝试'未来如果$ {看到$ _}> 0;' –

您还没有打开多个文件，你只有一个输入文件名存在，所以你怎么能指望它会影响多个文件？ – TLP

你正在寻找所有文件中的重复内容，还是仅仅在每个单独的文件中重复？ – TLP

答

我觉得你File_listing.txt包含的行，其中一些多次出现？如果是这样的话，只需要使用bash shell中：

sort --unique <File_listing.txt >Remove_duplicate.txt

或者，如果你喜欢的Perl：

perl -lne '$seen{$_}++ and next or print;' <File_listing.txt >Remove_duplicate.txt

答

脚本中的错误：

您覆盖（新副本）@ARGV与$file，所以它永远不会有任何更多的文件参数。
...这并不重要，因为您在分配到@ARGV之前打开文件句柄，再加上您不循环参数，您只需围绕没有任何用途的代码围绕{ ... }块。
%seen将包含您打开的所有文件的重复数据删除，除非您重置它。
您将计数$seen{$_}打印到输出文件，我确信您不需要。

你可以使用使用钻石操作@ARGV论点的隐含开放，但因为你（可能）需要为每个新文件分配一个合适的输出文件名，这是一个不必要的并发症，这样的解。

use strict; 
use warnings;      # always use these 

for my $file (@ARGV) {    # loop over all file names 
    my $out = "$file.deduped";  # create output file name 
    open my $infh, "<", $file or die "$file: $!"; 
    open my $outfh, ">", $out or die "$out: $!"; 
    my %seen; 
    while (<$infh>) { 
     print $outfh $_ if !$seen{$_}++; # print if a line is never seen before 
    } 
}

请注意，使用词法范围为%seen的变量会使脚本检查每个单独文件中的重复项。如果您将该变量移到for循环之外，您将检查所有文件中的重复项。我不确定你喜欢哪一个。

请修改：更改！$看过{$ _]至！$看过{$ _}。我试图自己改变它，但它至少需要改变6个字符:) – Tony

从perl中的多个文本文件中删除重复项？

相关推荐