如何索引的所有独特的语料用perl

问题描述：

.i 1 
.t 
effici machineindepend procedur 
garbag collect variou list structur 
.w 
method return regist free 
list essenti part list process 
system. paper past solut recoveri 
problem review compar. new algorithm 
present offer signific advantag speed 
storag util. routin implement 
algorithm written list languag 
     insur degre 
machin independ. final applic 
algorithm number differ list structur 
appear literatur indic. 
.b 
cacm august 1967 
.a 
schorr h. 
wait w. m. 
.n 
ca670806 jb februari 27 1978 428 pm 
.x 
1024 4 1549 

1024 4 1549 

1050 4 1549 

.i 2 
.t 
comparison batch process instant turnaround 
.w 
studi program effort student 
    introductori program cours present 
    effect have instant turnaround minut 
oppos convent batch process 
turnaround time hour examin. 
item compar number comput 
run trip comput center program prepar 
time keypunch time debug time 
number run elaps time run 
    run problem. 
result influenc fact bonu point 
given complet program problem 
    specifi number run 
evid support instant batch. 
.b 
cacm august 1967 
.a 
smith l. b. 
.n 
ca670805 jb februari 27 1978 432 pm 
.x 
1550 4 1550 

1550 4 1550 

1304 5 1550 

1472 5 1550

现在，上面的文字是2个文件，这是双方停止，朵朵的内容，新的文件从.I（后跟一个数字）开始的话需要做的在.t & .b，.b & .a，.a & .n，.n & .x之间的文本中索引文本，并忽略.x和新文档开始之间的所有文本。即I（后跟一个数字）如何索引的所有独特的语料用perl

所有文件的内容都存储在一个文件中，称为“语料库”。需要对它们出现在语料库和每个文档中的次数进行索引，可能是文档中的哪些位置。

open FILE, '<', 'sometext.txt' or die $!; 
my @texts = <FILE>; 
foreach my $text(@texts) { 
     my @lines = split ("\n",$text); 
     foreach my $line(@lines) { 
      my @words = split (" ",$text); 
      foreach my $word(@words) { 
       $word = trim($word); 
       my $match = qr/$word/i; 

       open STFILE, '<', 'sometext.txt' or die $!; 
       my $count=0; 

       while (<STFILE>) { 
        if ($_ =~ $match) { 
         my @mword = split /\s+/, $_; 
         $_ =~ s/[A-Za-z0-9_ ]//g; 
         for my $i (0..$#mword) { 
          if ($mword[$i] =~ $match) { 
           #print "match found on line $. word ", $i+1,"\n"; 
           $count++ 
          } 
         } 
        } 
       } 
       print "$word appears $count times \n"; 
       close(STFILE) or die "Couldn't close $file: $!\n\n"; 
      } 
     } 
    } 


    close(FILE) or die "Couldn't close $file: $!\n\n"; 

    sub trim($) 
{ 
    my $string = shift; 
    $string =~ s/^\s+//; 
    $string =~ s/\s+$//; 
    return $string; 
}

上述代码计算语料库中每个词的出现次数。如何更改它，以便它也计算单个文档中的单词的发生。

'$ ++计数{$词}' – ikegami 2012-04-13 21:31:49

答

如何：

编辑添加不同的计数器对每个文档：

#!/usr/bin/perl 
use strict; 
use warnings; 
use Data::Dumper; 

my $words; 
my $doc; 
my $file = 'path/to/file'; 
open my $fh, '<', $file or die "unable to open '$file' for reading:$!" 
while(<$fh>) { 
    chomp; 
    $doc = $_ if /^\.i/; 
    next if (/^\.x\b/ .. /^\.i\b/); 
    next if /^\./; 
    my @words = split; 
    for(@words) { 
     $words->{$_}{$doc}++; 
    } 
} 
close $fh; 
print Dumper $words;

上面的代码工作正常计算的话的频率，但我需要计算每个文档中单词的频率。例如： - “apple”在doc1中出现3次，在doc5中出现4次，在doc8中出现7次。等等... – lolla 2012-04-14 15:10:51

@ArpitaDuppala：看我的编辑。 – Toto 2012-04-15 09:01:48

我拿出了上面的代码在文件中。但我想使用自卸车中的值，如何提取/访问它？ – lolla 2012-04-17 03:18:26

答

使用散列，散列值包含每个单词的当前计数。循环所有的行和所有的单词。使用哑（基于标志变量）的状态机忽略.t和.b之间的文本

如果您在编写上述任何代码时遇到困难，请发布有关您卡住的具体问题。

如何索引的所有独特的语料用perl

相关推荐