Perl提取文本

问题描述：

我一直在研究这么久！我会感激你的帮助......Perl提取文本

我的文档将是什么样子：

<text> 
<text> command <+>= "stuff_i_need" <text> 
<text> 
<text> command <+>= stuff <text> 
<text> 
<text> command <+>= -stuff <text> 
<text>

任何与纠结括号是可选
东西可以是任何东西（苹果，橘子，香蕉），但它是什么，我需要提取
命令固定

我码S ø远：

#!/usr/bin/env perl 

use warnings; 
use strict; 
use Text::Diff; 

# File Handlers 
open(my $ofh, '>in.txt'); 
open(my $ifh, '<out.txt'); 

while (<$ifh>) 
{ 
    # Read in a line 
    my $line = $_; 
    chomp $line; 

    # Extract stuff 
    my $extraction = $line; 

    if ($line =~ /command \+= /i) {   
     $extraction =~ s/.*"(.*)".*/$1/; 
     # Write to file 
     print $ofh "$extraction\n"; 
    } 
}

您需要指定哪些运算符是有效的。在你的代码中你只能看到'+ ='，但是在你的示例输入中你有'+ ='和'='。 'command' vs'command2'也是一样。当你想提取单词时，你需要指定确切的条件。 – TLP 2012-08-17 18:28:21

感谢提示@TLP。我做了一些更改，请让我知道是否有帮助。 – Ryan 2012-08-17 18:40:17

答

基于示例性输入：

if ($line =~ /command\d*\s*\+?=\s*["-]?(\w+)"?/i) {  
    $extraction = $1; 
    print "$extraction\n"; 
}

答

几件事：

对于提取，不使用取代度（即，使用m//和不s///）。如果使用匹配，则匹配内的括号组将作为列表返回（如果您愿意，则分配给$1，$2，$3等）。
=~绑定您想要匹配的变量。所以你想$extraction实际上是$line。
.*您的.*匹配过于贪婪，并会阻止匹配以您想要的方式继续。 “贪婪”的意思是.*将与您行中尾随的"匹配。它将消耗线上的其余输入，然后尝试匹配"，并且因为您已达到该行的末尾而失败。

你想指定该单词可能是什么。例如，如果是字母，然后匹配[a-zA-Z]

my ($extraction) = $line =~ /command \+= "([a-zA-Z]*)"/;

如果它是一个数字，你想[0-9]：

my ($extraction) = $line =~ /command \+= "([0-9]*)"/;

如果它可能是除了"任何东西，使用[^"]，意为“什么，但" “：

my ($extraction) = $line =~ /command \+= "([^"]*)"/;

通常有助于试图匹配只是你在找什么FO而不是毯子.*。

根据样本输入，在匹配的单词周围不需要引号。 – newfurniturey 2012-08-17 18:38:14

不，他们不是，但我一次只想修理一件事。如果他能得到一场比赛，他可以开始调整，并从那里开始工作。在这个问题中有太多问题需要在这里的任何一个答案中处理。 – zostay 2012-08-17 18:39:29

答

下面的正则表达式会帮助你：

m{ 
    (?<= =)  # Find an `=` 
    \s*    # Match 0 or more whitespaces 
    (?:    # Do not capture 
     [ " \- ] # Match either a `"` or a `-` 
    )?    # Match once or never 
    (    # Capture 
     [^ " \s ]+ # Match anything but a `"` or a whitespace 
    ) 
}x;

答

下面的一行将提取随后通过一个可选的加号前缀等号一个单词（不含空格字符序列），包围可选引号。它将从in.txt读取并写入out.txt。

perl -lne 'push @a, $1 if /command\s*\+?=\s*("?\S+"?)/ }{ 
    print for @a' in.txt > out.txt

完整的代码 - 如果你喜欢脚本形式 - 是：

BEGIN { $/ = "\n"; $\ = "\n"; } 
LINE: while (defined($_ = <ARGV>)) { 
    chomp $_; 
    push @a, $1 if /command\s*\+?=\s*("?\S+"?)/; 
} 
{ 
    print $_ foreach (@a); 
}

的the O module的Deparse功能的礼貌。

答

一个轻的解决方案。

#!/usr/bin/env perl 
use warnings; 
use strict; 

open my $ifh, '<','in.txt'; 
open my $ofh, '>', 'out.txt'; 

while (<$ifh>) 
{ 
    if (/ 
     \s command\s\+?=\s 
     (?:-|("))?  # The word can be preceded by an optional - or " 
     (\w+) 
     (?(1)\1)\s+ # If the word is preceded by a " it must be end 
         # with a " 
     /x) 
    { 
     print $ofh $2."\n"; 
    } 
}

相关推荐